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Preface 



This volume contains revised versions of selected papers presented during 
the 28th Annual Conference of the Gesellschaft fiir Klassifikation (GfKl) , the 
German Classification Society. The conference was held at the Universitat 
Dortmund in Dortmund, Germany, in March 2004. Wolfgang Gaul chaired 
the program committee, Claus Weihs and Ernst-Erich Doberkat were the 
local organizers. Patrick Groenen, Iven van Mechelen, and their colleagues 
of the Vereniging voor Ordinatie en Classificatie (VOC), the Dutch-Flemish 
Classification Society, organized special VOC sessions. 

The program committee recruited 17 notable and internationally renown- 
ed invited speakers for plenary and semi-plenary talks on their current re- 
search work regarding classification and data analysis methods as well as ap- 
plications. In addition, 172 invited and contributed papers by authors from 18 
countries were presented at the conference in 52 parallel sessions representing 
the whole field addressed by the title of the conference “Classification: The 
Ubiquitous Challenge” . Among these 52 sessions the VOC organized sessions 
on Mixture Modelling, Optimal Scaling, Multiway Methods, and Psychomet- 
rics with 18 papers. Overall, the conference, which is traditionally designed as 
an interdisciplinary event, again provided an attractive forum for discussions 
and mutual exchange of knowledge. 

Besides the results obtained in the fundamental subjects Classification 
and Data Analysis, the talks in the applied areas focused on various appli- 
cation topics. Moreover, along with the conference a competition on “Social 
Milieus in Dortmund” , co-organized by the city of Dortmund, took place. 
Hence the presentation of the papers in this volume is arranged in the fol- 
lowing parts: 

I. (Semi-)Plenary Presentations 

II. Classification and Data Analysis 

III. Applications, and 

IV. Contest: Social Milieus in Dortmund. 

The part on applications has sub-chapters according to the different applica- 
tion fields Archaeology, Astronomy, Bio-Sciences, Electronic Data and Web, 
Finance and Insurance, Library Science and Linguistics, Macro-Economics, 
Marketing, Music Science, and Quality Assurance. Within (sub-)parts pa- 
pers are mainly arranged in alphabetical order with respect to (first) author’s 
names. 



I. 

Plenary and semi-plenary lectures enclose both conceptual and applied 
papers. Among the conceptual papers Erosheva and Fienberg present a fully 



VI 



Preface 



Bayesian approach to soft clustering and classification within a general frame- 
work of mixed membership, Friendly introduces the Milestones Project on 
documentation and illustration of historical developments in statistical graph- 
ics, Hornik discusses consensus partitions particularly when applied to ana- 
lyze the structure of cluster ensembles, Kiers gives an overview of procedures 
for constructing bootstrap confidence intervals for the solutions of three-way 
component analysis techniques, Pahl argues that a classification framework 
can organize knowledge about software components’ characteristics, and Uter 
and Gefeller define partial attributable risk as a unique solution for allocating 
shares of attributable risk to risk factors. Within the applied papers Beran 
presents preprocessing of musical data utilizing prior knowledge from musicol- 
ogy, Fischer et al. introduce a method for the prediction of spatial properties 
of molecules from the sequence of amino acids incorporating biological back- 
ground knowledge, Grzybek et al. discuss how far word length may contribute 
to quantitative typology of texts, and Snoek and Worring present the Time 
Interval Multimedia Event framework as a robust approach for classification 
of semantic events in multimodal soccer video. 

II. 

The second part of this volume is concerned with methodological progress 
in classification and data analysis and methods presented cover a variety of 
different aspects. 

In the Classification part, more precise confidence intervals for the pa- 
rameters of latent class models using the bootstrap method are proposed 
(Dias), as well as a method of feature selection for ensembles that signif- 
icantly reduces the dimensionality of subspaces (Gatnar), and a sensitive 
two-stage classification system for the detection of events in spite of a noisy 
background in the processing of thousands of images in a few seconds (Hader 
and Hamprecht). Variants of bagging and boosting are discussed, which make 
use of an ordinal response structure (Hechenbichler and Tutz) , a methodology 
for exploring two quality aspects of cluster analyses, namely separation and 
homogeneity of clusters (Hennig) , and a comparison of Adaboost to Arc-x(h) 
for different values of h in the subsampling of binary classification data is car- 
ried out (Khanchel and Limam). The method of distance-based discriminant 
analysis (DDA) is introduced finding a linear transformation that optimizes 
an asymmetric data separability criterion via iterative majorization and the 
necessary number of discriminative dimensions (Kosinov et al.), an efficient 
hybrid methodology to obtain CHAID tree segments based on multiple de- 
pendent variables of possibly different scale types is proposed (Magidson and 
Vermunt), and possibilities of defining the expectation of p-dimensional inter- 
vals (Nordhoff) are described. Design of experiments is introduced into vari- 
able selection in classification (Pumpliin et al.), as well as the KMC/EDAM 
method for classification and visualization as an alternative to Kohonen Self- 
Organizing Maps (Raabe et al.). A clustering of variables approach extended 
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to situations with missing data based on different imputation methods (Sah- 
mer et al.), a method for binary online-classification incorporating temporal 
distributed information (Schafer et al.), and a concept of characteristic re- 
gions and a new method, called DiSCo, to simultaneously classify and visu- 
alize data (Szepannek and Luebke) are described. The part concludes with 
two papers discussing multivariate Pareto Density Estimation (PDE), based 
on information optimality, for data sets containing clusters (Ultsch) and an 
extension of standard latent class or mixture models that can be used for the 
analysis of multilevel and repeated measures data (Vermunt and Madgison) . 

The part on Data Analysis starts with papers proposing a robust pro- 
cedure for estimating a covariance matrix under conditional independence 
restrictions in graphical modelling (Becker) and a new approach to find prin- 
cipal curves through a multidimensional, possibly branched, data cloud (Ein- 
beck et al.). A three-way multidimensional scaling approach developed to 
account for individual differences in the judgments about objects, persons or 
brands (Krolak-Schwerdt) , and the Time Series Knowledge Mining (TSKM) 
framework to discover temporal structures in multivariate time series based 
on the Unification-based Temporal Grammar (UTG) (Morchen and Ultsch) 
are introduced. A framework for the comparison of the information in contin- 
uous and categorical data (Nishisato) and an external analysis of two-mode 
three-way asymmetric multidimensional scaling for the disclosure of asymme- 
try (Okada and Imaizumi) are presented. Finally, nonparametric regression 
with the Relevance Vector Machine under inclusion of covariate measurement 
error (Rummel) is described. 



III. 

In the third part of this volume all contributions are also related to ap- 
plications of classification and data analysis methods but structured by their 
application field. 

Two papers deal with applications in Archaeology. The first is a his- 
torical overview (Ihm) over early publications about formal methods on seri- 
ation of archaeological finds, in the second article some cluster analysis mod- 
els including different data transformations in order to differentiate between 
brickyards of different areas on the basis of chemical analysis are investigated 
(Mucha et al.). 

Another two papers (both by Bailer- Jones) discuss applications in As- 
tronomy. A brief overview of the upcoming Gaia astronomical survey mis- 
sion, a major European project to map and classify over a billion stars in our 
Galaxy, and an outline of the challenges are given in the first paper while in 
the second a novel method based on evolutionary algorithms for designing 
filter systems for astronomical surveys in order to provide optimal data on 
stars and to determine their physical parameters is introduced. 

The articles with applications in the Bio-Sciences all deal with enzyme, 
DNA, microarray, or protein data, except the presentation of results of a sys- 
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tematic and quantitative comparison of pattern recognition methods in the 
analysis of clinical magnetic resonance spectra applied to the detection of 
brain tumor (Menze et al.). The Generative Topographic Mapping approach 
as an alternative to SOM for the analysis of microarray data (Grimmenstein 
et al.) and a finite conservative test for detecting a change point in a bi- 
nary sequence with Markov dependence and applications in DNA analysis 
(Krauth) are proposed as well as a new algorithm for finding similar sub- 
structures in enzyme active sites with the use of emergent self-organizing 
neural networks (Kupas and Ultsch). How the feature selection procedure 
“Significance Analysis of Microarrays” (SAM) and the classification method 
“Prediction Analysis of Microarrays” (PAM) can be applied to “Single Nu- 
cleotide Polymorphism” (SNP) data is explained (Schwender) as well as that 
using relative differences (RelDiff) instead of LogRatios for cDNA microarray 
analysis solves several problems like unlimited ranges, numerical instability 
and rounding errors (Ultsch). Finally, a novel method, PhyNav, to reconstruct 
the evolutionary relationship from really large DNA and protein datasets is 
introduced applying the maximum likelihood principle (Vinh et al.). 

Among the contributions on applications to Electronic Data and Web 
one paper discusses the application of clustering with restricted random walks 
on library usage histories in large document sets containing millions of objects 
(Franke and Thede). In the other four papers different aspects of web- mining 
are tackled. A tool is described assisting users of online news web-sites in 
order to reduce information overload (Bomhardt and Gaul), benchmarks are 
offered with respect to competition and visibility indices as predictors for 
traffic in web-sites (Schmidt-Manz and Gaul), an algorithm is introduced for 
fuzzy two-mode clustering that outperforms collaborative filtering (Schlecht 
and Gaul), and visualizations of online search queries are compared to im- 
prove understanding of searching, viewing, and buying behavior of online 
shoppers and to further improve the generation of recommendations (Thoma 
and Gaul). 

Two of the articles on Finance and Insurance deal with insurance 
problems: A strategy based on a combination of support vector regression 
and kernel logistic regression to detect and to model high-dimensional de- 
pendency structures in car insurance data sets is proposed (Christmann) and 
support vector machines are compared to traditional statistical classification 
procedures in a life insurance environment (Steel and Hechter). Applications 
in Finance deal with evaluation of global and local statistical models for 
complex data sets of credit risks with respect to practical constraints and 
asymmetric cost functions (Schwarz and Arminger), show how linear sup- 
port vector machines select informative patterns from a credit scoring data 
pool serving as inputs for traditional methods more familiar to practitioners 
(Stecking and Schebesch), analyze the question of risk budgeting in contin- 
uous time (StraBberger), and formulate a one-factor model for the correla- 
tion between probabilities of default across industry branches, comparing it 
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to more traditional methods on the basis of insolvency rates for Germany 
(Weifibach and Rosenow). 

Besides one contribution on Library Science where it is argued that the 
history of classification is intensively linked to the history of library science 
(Lorenz) the volume encloses five papers on applications in Linguistics. 
It is shown that one meta-linguistic relation suffices to model the concept 
structure of the lexicon making use of intensional logic (Bagheri), that im- 
provements of the morphological segmentation of words using classical dis- 
tributional methods are possible (Benden) , and that in Russian texts (letters 
and poems by three different authors) word length is a characteristic of genre, 
rather than of authorship (Kelih et al.). A validation method of cluster analy- 
sis methods concerning the number and stability of clusters is described with 
the help of an application in linguistics (Mucha and Haimerl), clustering of 
word contexts is used in a large collection of texts for word sense induction, 
i.e. automatic discovery of the possible senses for a given ambiguous word 
(Rapp), and formal graphs that structure a document-related information 
space by using a natural language processing chain and a wrapping proce- 
dure are proposed (Rist). 

There are three papers with applications in Macro-Economics, two of 
them dealing with the comparison of economic structures of different coun- 
tries. The sensitivity of economic rankings of countries based on indicator 
variables is discussed (Berrer et al.), structural variables of the 25 member 
European Union are analyzed and patterns are found to be quite different 
between the 15 current and the 10 new members (Sell), while the question 
whether methods measuring (relative) importance of variables in the context 
of classification allow interpretation of individual effects of highly correlated 
economic predictors for the German business cycle (Enache and Weihs) is 
tackled in a more methods-based contribution. 

Within the Marketing applications one article shows by means of an 
intercultural survey (Bauer et al.) that the cyber community is not a homo- 
geneous group since online consumers can be classified into the three clusters: 
“risk avers doubters”, “open minded online-shoppers” and “reserved infor- 
mation seekers”. Two papers deal with reservation prices. A novel estimation 
procedure of reservation prices combining adaptive conjoint analysis with 
a choice task using individually adapted price scales is proposed (Breidert 
et al.), and an explicit evaluation of variants of conjoint analysis together 
with two types of data collection is described for the detection of reservation 
prices of product bundles applied to a seat system offered by a German car 
manufacturer (Staufi and Gaul). 

Music Science is an application field that is present at GfKl conferences 
for the first time. In this volume one paper deals with time series analysis, the 
other five papers apply classification methods. A new algorithm structure is 
introduced for feature extraction from time series, its efficiency is proofed, and 
illustrated by different classification tasks for audio data (Mierswa) . Classifi- 
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cation methods are used to show that the more the musical sound is unstable 
in time domain the more pitch bending is admitted to the musician expressing 
emotions by music (Fricke). Classification rules for quality classes of “sight 
reading” (SR) are derived (Kopiez et al.) based on indicators of piano prac- 
tice, mental speed, working memory, inner hearing etc. as well as the total SR 
performance of 52 piano students. Classification rules are also found for dig- 
itized sounds played by different instruments based on the Hough-transform 
(Rover et al.). Finally, classifications of possibly overlapping drum sounds 
by linear support vector machines (Van Steelant et al.) and of singers and 
instruments into high or low musical registers only by means of timbre, i.e. 
after elimination of pitch information, are proposed (Weihs et al.). 

Applications in Quality Assurance include one methodological paper 
(Jessenberger and Weihs) which proposes the use of the expected value of the 
so-called desirability function to assess the capability of a process. The other 
papers discuss different statistical aspects of a deep hole drilling process in 
machine building. The Lyapunov exponent is used for the discrimination be- 
tween well-predictable and not- well-predictable time series with applications 
in quality control (Busse). Two multivariate control charts to monitor the 
drilling process in order to prevent chatter vibrations and to secure produc- 
tion with high quality are proposed (Messaoud et al.) as well as a procedure 
to assess the changing amplitudes of relevant frequencies over time based on 
the distribution of periodogram ordinates (Theis and Weihs) . 

IV. 

The fourth part of this volume starts with an introduction to the competi- 
tion on “Social Milieus in Dortmund” (Sommerer and Weihs). Moreover, the 
best three papers of the competition by Scheid, by Schafer and Lemm, and 
by Rover and Szepannek appear in this volume. We would like to thank the 
head of the “dortmund-project” , Udo Mager, and the head of the Fachbereich 
“Statistik und Wahlen” of the City of Dortmund, Ernst-Otto Sommerer, for 
their kind support. 

The conference owed much to its sponsors (in alphabetical order) 

• Deutsche Forschungsgemeinschaft (DFG), Bonn, 

• dortmund-project, Dortmund, 

• Fachbereich Statistik, Universitat Dortmund, Dortmund, 

• Landesbeauftragter fur die Beziehungen zwischen den Hochschulen in 

NRW und den Beneluxstaaten, 

• Novartis, Basel, Switzerland, 

• Roche Diagnostics, Penzberg, 

• sas Deutschland, Heidelberg, 

• Sonderforschungsbereich 475, Dortmund, 

• Springer- Verlag, Heidelberg, 

• Universitat Dortmund, and 

• John Wiley and Sons, Chicester, UK. 
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who helped in many ways. Their generous support is gratefully acknowledged. 

Additionally, we wish to express our gratitude to the authors of the pa- 
pers in the present volume, not only for their contributions, but also for their 
diligence and timely production of the final versions of their papers. Fur- 
thermore, we thank the reviewers for their careful reviews of the originally 
submitted papers, and in this way, for their support in selecting the best 
papers for this publication. 

We would like to emphasize the outstanding work of Uwe Ligges and Nils 
Raabe who did an excellent job in organizing the program of the confer- 
ence and the refereeing process as well as in preparing the abstract booklet 
and this volume, respectively. We also wish to thank our colleague Prof. 
Dr. Ernst-Erich Doberkat, Fachbereich Informatik, University Dortmund, 
for co-organizing the conference, and the Fachbereich Statistik of the Uni- 
versity Dortmund for all the support, in particular Anne Christmann, Dr. 
Daniel Enache, Isabelle Grimmenstein, Dr. Sonja Kuhnt, Edelgard Kiirbis, 
Karsten Luebke, Dr. Constanze Pumpliin, Oliver Sailer, Roland Schultze, 
Sibylle Sturtz, Dr. Winfried Theis, Magdalena Thone, and Dr. Heike Traut- 
mann as well as other members and students of the Fachbereich for helping to 
organize the conference and making it a big success, and Alla Stankjawitsch- 
ene and Dr. Stefan Difimann from the Fachbereich Informatik for all they did 
in organizing all financial affairs. 

Finally, we want to thank Christiane Beisel and Dr. Martina Bihn of 
Springer- Verlag, Heidelberg, for their support and dedication to the produc- 
tion of this volume. 
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Classification and Data Mining in Musicology 



Jan Beran 

Department of Mathematics and Statistics, 
University of Konstanz, 78457 Konstanz, Germany 



Abstract. Data in music are complex and highly structured. In this talk a number 
of descriptive and model-based methods are discussed that can be used as pre- 
processing devices before standard methods of classification, clustering etc. can 
be applied. The purpose of pre-processing is to incorporate prior knowledge in 
musicology and hence to filter out information that is relevant from the point of 
view of music theory. This is illustrated by a number of examples from classical 
music, including the analysis of scores and of musical performance. 



1 Introduction 

Mathematical considerations in music have a long tradition. The most ob- 
vious connection between mathematics and music is through physics. For 
instance, in ancient Greece, the Pythagoreans discovered the musical signif- 
icance of simple frequency ratios such as 2/1 (octave), 3/2 (pure fifth), 4/3 
(pure fourth) etc., and their relation to the length of a string. There are, how- 
ever, deeper connections between mathematical and musical structures that 
go far beyond acoustics. Many of these can be discovered using techniques 
from data mining, together with a priori knowledge from music theory. The 
results can be used, for instance, to solve classification problems. This is 
illustrated in the following sections by three types of examples. 



2 Music, l//-noise, fractal and chaos 

In their celebrated but also controversial - paper, Voss and Clarke (1975) 
postulated that recorded music is essentially l//-noise (in the spectral do- 
main), after high frequencies have been eliminated. (The term l//-noise is 
generally used for random processes whose power spectrum is dominated by 
low frequencies / such that its value is proportional to 1//.) Can we verify 
this statement? At first, the following question needs to be asked: Which 
aspects of a composition does recorded music represent? Sound waves are 
determined not only by the selection of notes, but also by the instrumental 
sound itself. It turns out, however, that the sound wave of a musical instru- 
ment often resembles l//-noise (see e.g. Beran (2003)). Thus, if recorded 
music looks like l//-noise, this may be due to the instrument rather than 
a particular composition. To separate instrumental sounds from composed 
music, we therefore consider the score itself, in terms of pitch and onset 
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time. The problem of superposition of notes in polyphonic music is solved 
by replacing chords by arpeggio chords, replacing a chord by the sequence of 
notes in the chord starting with the lowest note. In order to eliminate high 
frequencies and to simplify the spectral density, data are aggregated by tak- 
ing averages over disjoint blocks of k = 7 notes (see Beran and Ocker (2001) 
and Tsai and Chan (2004) for a theoretical justification). Subsequently, a 
semiparametric fractional model with nonparametric trend function, the so- 
called SEMIFAR-model (Beran and Feng (2002), also see Beran (1994)), is 
fitted to the aggregated series. In a SEMIFAR-model, the stochastic part has 
a generalized spectral density behaving at the origin like 1//“ (where / is 
the frequency) with a = 2d for some — ^ < d. Thus, l//-noise corresponds 
to d = 1/2. Figure 1 shows smoothed histograms of a for four different time 
periods. The results are based on 60 compositions ranging from the 13th to 
the 20th century. Apparently a value around a = 1 is favored in classical 
music up to the early romantic period (first three distributions, from above). 
However, this preference is less clear in the late 19th and the 20th century. 
Similar investigations can be made for other characteristics of a composi- 
tion. For instance, we may consider onset time gaps between the occurence 
of a particular note. Figure 2 displays typical log-log-periodograms and fitted 
spectra, for gap series referring to the most frequent note (modulo 12). Note 
that, near zero, each fitted log-log-curve essentially behaves like a straight 
line with estimated slope a. 

In summary, we may say that l//“-behaviour with a > 0 appears to be 
common for many musical parameters. The fractal parameter a = 2d may 
be interpreted as a summary statistic of the degree of variation and memory. 
From the examples here it is clear, however, that 1 //-noise is not the only, 
though perhaps the most frequent, type of variation. 



3 Music and entropy 

The fractal parameter d (or a = 2d) is a measure of randomness and coher- 
ence (memory) in the sense mentioned above. Another, in some sense more 
direct, measure of randomness is entropy. Consider, for instance, the distri- 
bution of notes modulo 12 and its entropy. We calculate the entropy for 148 
compositions by the following composers: Anonymus (dates of birth between 
1200 and 1500), Halle (1240-1287), Ockeghem (1425-1495), Arcadelt (1505- 
1568), Palestrina (1525-1594), Byrd (1543-1623), Dowland (1562-1626), Has- 
sler (1564-1612), Schein (1586-1630), Purcell (1659-1695), D. Scarlatti (1660- 
1725), F. Couperin (1668-1733), Croft (1678-1727), Rameau (1683-1764), 
J.S. Bach (1685-1750), Campion (1686-1748), Haydn (1732-1809), Clementi 
(1752-1832), W.A. Mozart (1756-1791), Beethoven (1770-1827), Chopin 
(1810-1849), Schumann (1810-1856), Wagner (1813-1883), Brahms (1833- 
1897), Faure (1845-1924), Debussy (1862-1918), Scriabin (1872-1915), Rach- 
maninoff (1873-1943), Schoenberg (1874-1951), Bartok (1881-1945), Webern 
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Distribution of -2d: up to 1700 




Fig. 1. Distribution of —a = —2d for four different time periods. 



(1883-1945), Prokoffieff (1891-1953), Messiaen (1908-1992), Takemitsu (1930- 
1996) and Beran (*1959). For a detailed description how the entropy is cal- 
culated see Beran (2003). A plot of entropy against the date of birth of the 
composer (figure 3) reveals a positive dependence, in particular after 1400. 
Why that is so can be seen, at least partially, from star plots of the distribu- 
tions. Figure 4 shows a random selection of star plots ranging from the 15th to 
the 20th century. In order to reveal more structure, the 12 note categories are 
ordered according to the ascending circle of fourths. The most striking feature 
is that for compositions that may be classified as purely tonal in a traditional 
sense, there is a neighborhood of 7 to 8 adjacent notes where beams are very 
long, and for the rest of the categories not much can be seen. The plausible 
reason is that in tonal music, the circle of fourths is a dominating feature 
that determines a lot of the structure. This is much less the case for classical 
music of the 20th century. With respect to entropy it means that for newer 
music, the (marginal) distribution of notes is much less predictable than in 
earlier music (see figure 3 where composers born after 1881 are marked as 
“20th century”, namely Prokoffieff, Messiaen, Takemitsu, Webern and Be- 
ran). Note, however, that there are also a few outliers in figure 3. Thus, the 
rule is not universal, and entropy may depend on the individual composer or 
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Bach: Prelude and Fugue, WK I, No. 17, 
spectrum of aggregated gaps (d=0.5) 



Rameau: Le Tambourin, 
spectrum of aggregated gaps (d=0.5) 
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Fig. 2. Log-log-periodograms and fitted spectra for gap time series. 



even the composition. In the last millennium, music moved gradually from 
rather strict rules to increasing variety. It is therefore not surprising that 
variability increases throughout the centuries - composers simply have more 
choice. On the other hand, a comparison of Schumann’s entropies (which were 
not included in figure 3) with those by Bach points in the opposite direction 
(figure 5). As a cautionary remark it should also be noted that this data set 
is a very small, and partially unbalanced, sample from the huge number of 
existing compositions. For instance, Prokofhcff is included 15 times whereas 
many other composers of the 20th century are missing. A more systematic 
empirical investigation will need to be carried out to obtain more conclusive 
results. 

4 Score information and performance 

Due to advances in music technology, performance theory is a very active 
area of research where statistical analysis plays an essential role. In contrast 
to some other branches of musicology, repeated observations and controlled 
experiments can be carried out. With respect to music where a score exists, 
the following question is essential: Which information is there in a score, 
and how can it be quantified? Beran and Mazzola (1999a) (also see Maz- 
zola (2002) and Beran (2003)) propose to encode structural information of a 
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Fig. 3. Entropy of notes in Z 12 versus date of birth. 



score by so-called metric, harmonic and melodic weights or indicators. These 
curves quantify the metric, harmonic and melodic importance of a note re- 
spectively. A modified motivic indicator based on a priori knowledge about 
motifs in the score is defined in Beran (2003). Figure 6 shows some indicator 
functions corresponding to eight different motifs in Schumann’s Traumerei. 
These curves can be related to observed performance data by various sta- 
tistical methods (see e.g. Beran (2003), Beran and Mazzola (1999b, 2000, 
2001)). For instance, figure 7 displays tempo curves of different pianists after 
applying data sharpening with the indicator function of motif 2. Sharpening 
was done by considering only those onset times where the indicator curve 
of motif 2 is above its 90th percentile. This leads to simplified tempo curves 
where differences and communalities are more visible. Also, sharpened tempo 
curves can be used as input for other statistical techniques, such as classifi- 
cation. A typical example is given in figure 8, where clustering is based the 
motif- 2-sharpened tempo curves in figure 7. 



Acknowledgements 

I would like to thank B. Repp for providing us with the tempo measurements. 



Beran 




WAGNER 




WAGNER 




DEBUSSY 




DEBUSSY 




RAMEAU 



SCARLATTI 




SCHUMANN 



SCRIABIN 




SCRIABIN 



BARTOK BARTOK BARTOK 

PROKOFFIEFF MESSIAEN SCHOENBERG 



BARTOK 



WEBERN 



MESSIAEN PROKOFFIEFF 



TAKEMITSU BERAN 



Fig. 4 . Star plots of Zi2-distribution, ordered according to the circle of fourths. 




Fig. 5. Boxplots of entropies for Bach (left) and Schumann (right), based on note 
distribution in Z12. 




Classification and Data Mining in Musicology 9 



11 nil III 


lull li ill III 


onseyime onseyi™ 


ilillHil ■ 


lllllill 


onseyime 


onseyime 


til liftltH ill ' 


nnnTD 


onseyime 




IIIIIMlIIIIMlI ■ 








Fig. 6. Motivic indicators for Schumann’s Traumerei. 



ARGERICH ARRAU ASKENAZE BRENDEL 









i; 


BUNIN 


CAPOVA 


CORTOT1 


CORTOT2 


aaTV^ 






; 


CORTOT3 


CURZON 


DAVIES 


DEMUS 






: ' — 




ESCHENBACH 


GIANOLI 


HOROWITZl 


HOROWITZ2 




nr 


;WWV 


;^TA 7 


HOROWITZ3 


KATSARIS 


KUEN 


KRUST 


^rAr 


: V ^ /N/ V~ 




' \/~ 


KUBALEK 


MOISEIWITSCH 


NEY 


NOVAES 




v - /x -^V^ / yx. 






ORTIZ 


SCHNABEL 


SHELLEY 


ZAK 








; 



Fig. 7. Schumann’s Traumerei: Tempo curves sharpened by 90th percentile of 
motif-curve 2. 







10 



Beran 



Motive-2-indicator: 90%-quantile-clustering 




Fig. 8. Schumann’s Traumerei: Tempo clusters based on sharpened tempo. 
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Abstract. The paper describes and applies a fully Bayesian approach to soft clus- 
tering and classification using mixed membership models. Our model structure 
has assumptions on four levels: population, subject, latent variable, and sampling 
scheme. Population level assumptions describe the general structure of the popula- 
tion that is common to all subjects. Subject level assumptions specify the distribu- 
tion of observable responses given individual membership scores. Membership scores 
are usually unknown and hence we can also view them as latent variables, treating 
them as either fixed or random in the model. Finally, the last level of assumptions 
specifies the number of distinct observed characteristics and the number of replica- 
tions for each characteristic. We illustrate the flexibility and utility of the general 
model through two applications using data from: (i) the National Long Term Care 
Survey where we explore types of disability; (ii) abstracts and bibliographies from 
articles published in The Proceedings of the National Academy of Sciences. In the 
first application we use a Monte Carlo Markov chain implementation for sampling 
from the posterior distribution. In the second application, because of the size and 
complexity of the data base, we use a variational approximation to the posterior. 
We also include a guide to other applications of mixed membership modeling. 



1 Introduction 

The canonical clustering problem has traditionally had the following form: 
for N units or objects measured on J variables, organize the units into G 
groups, where the nature, size, and often the number of the groups is un- 
specified in advance. The classification problem has a similar form except 
that the nature and the number of groups are either known theoretically or 
inferred from units in a training data set with known group assignments. In 
machine learning, methods for clustering and classification are referred to 
as involving “unsupervised” and “supervised learning” respectively. Most of 
these methods assume that every unit belongs to exactly one group. In this 
paper, we will primarily focus on clustering, although methods described can 
be used for both clustering and classification problems. 
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Some of the most commonly used clustering methods are based on hi- 
erarchical or agglomerative algorithms and do not employ distributional as- 
sumptions. Model-based clustering lets x = (x\, X 2 , ■ ■ ■ , xj ) be a sample of J 
characteristics from some underlying joint distribution, Pr(x\9). Assuming 
each sample is coming from one of G groups, we estimate Pr(x\9) indicat- 
ing presence of groups or lack thereof. We represent the distribution of the 
gth group by Pr g (x\9) and then model the observed data using the mixture 
distribution: 



with parameters {9,n g }, and G. 

The assumption that each object belongs exclusively to one of the G 
groups or latent classes may not hold, e.g., when characteristics sampled are 
individual genotypes, individual responses in an attitude survey, or words 
in a scientific article. In such cases, we say that objects or individuals have 
mixed membership and the problem involves soft clustering when the nature 
of groups is unknown or soft classification when the nature of groups is known 
through distributions Pr g (x\9), g = 1, . . . , G, specified in advance. 

Mixed membership models have been proposed for applications in several 
diverse areas. We describe six of these here: 

1. NLTCS Disability Data. The National Long Term Care Survey assesses 
disability in U.S. elderly population. We have been working with a 2 16 
contingency table on functional disability drawing on combined data from 
the 1982, 1984, 1989, and 1994 waves of the survey. The dimensions of 
the table correspond to 6 Activities of Daily Living (ADLs)-e.g., getting 
in/out of bed and using a toilet-and 10 Instrumental Activities of Daily 
Living (IADLs)-e.g., managing money and taking medicine. In Section 
3, we describe some of our results for the combined NLTCS data. We 
note that further model extensions are possible to account for the lon- 
gitudinal nature of the study, e.g., via employing a powerful conditional 
independence assumption to accommodate a longitudinal data structure 
as suggested by Manton et al. (1994). 

2. DSM-III-R Psychiatric Classifications. One of the earliest proposals for 
mixed membership models was by Woodbury et al. (1978), in the con- 
text of disease classification. Their model became known as the Grade 
of Memebership or GoM model, and was later used by Nurnberg et al. 
(1999) to study the DSM-III-R typology for psychiatric patients. Their 
analysis involved N = 110 outpatients and used the J = 112 DSM-III-R 
diagnostic criteria for clustering in order to reassess the appropriateness 
of the “official” 12 personality disorders. One could also approach this 
problem as a classical classification problem but with J > N. 



G 




(1) 
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3. Peanut Butter Market Segmentation. Seetharaman et al. (2001) describe 
data on peanut butter purchases drawn from A.C. Nielsen’s scanner data- 
base. They work with data from 488 households over 4715 purchase occa- 
sions (chosen such that there are at least 5 per household) for 8 top brands 
of peanut butter. For each choice occasion we have: (a) shelf price, (b) 
information on display/feature promotion, and a set of household char- 
acteristics used to define “market segments” or groupings of households. 
Market segmentation has traditionally been thought of as a standard 
clustering problem but Varki et al. (2000) proposed a mixed-membership 
model for this purpose which is a variant on the GOM model. 

4. Matching Words and Pictures. Blei and Jordan (2003) and Barnard et al. 
(2003) have been doing mixed-membership modeling in machine learn- 
ing combining different sources of information in text documents, i.e. , 
main text, photographic images, and image annotations. They estimate 
the joint distribution of these characteristics via employing hierarchical 
versions of a model known as the Latent Dirichlet Allocation in machine 
learning. This allows them to perform such tasks as automatic image an- 
notations (recognizing image regions that portray, for example, clouds, 
water, and flowers) and text-based image retrieval (finding unannotated 
images that correspond to a given text query) with remarkably good 
performance. 

5. Race and Population Genetic Structure. In a study of human population 
structure Rosenberg et al. (2002) used genotypes at 377 autosomal mi- 
crosatellite loci in 1056 individuals from 52 populations and part of their 
analysis focuses on the soft clustering of individuals in groups. One of 
the remarkable results of their study which uses the mixed membership 
methods of Pritchard et al. (2002), is a typology structure that is very 
close to the “traditional” 5 main racial groups, a notion much maligned 
in the recent social science and biological literatures. 

6. Classifying Scientific Publications. Erosheva, Fienberg et al. (2004) and 
Griffiths and Styvers (2004) have used mixed membership models to 
analyse related data bases involving abstracts, text, and references of 
articles drawn from the Proceedings of the National Academy of Sciences 
USA (PNAS). Their mutual goal was to understand the organization 
of scientific publications in PNAS and we explore the similarities and 
differences between their approaches and results later in Section 4. 

What these examples have in common is the mixed membership struc- 
ture. In the following sections, we first introduce our general framework for 
mixed membership models and then we illustrate its application in two of 
the examples, using the PNAS and NLTCS data sets. 
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2 Mixed membership models 

The general mixed membership model relies on four levels of assumptions: 
population, subject, latent variable, and sampling scheme. At the population 
level, we describe the general structure of the population that is common 
to all subjects, while at the subject level we specify the distribution of ob- 
servable responses given individual membership scores. At the latent variable 
level, we declare whether the membership scores are considered fixed or ran- 
dom with some distribution. Finally, at the last level, we specify the number 
of distinct observed characteristics and the number of replications for each 
characteristic. Following the exposition in Erosheva (2002) and Erosheva et 
al. (2004), we describe the assumptions at the four levels in turn. 

Population level. We assume that there are K basis subpopulations (extreme 
or pure types) in the population of interest. For each subpopulation k, we de- 
note by f(xj\9kj) the probability distribution for response variable j, where 
6f~j is a vector of parameters. Moreover, we assume that, within a subpopu- 
lation, responses for the observed variables are independent. 

Subject level. For each subject, membership vector A = (Ai, . . . , A k) repre- 
sents the degrees of a subject’s membership in each of the subpopulations 
or the consonance of the subject with each of the pure types. The form of 
the conditional probability, Pr{xj\ A) = ^2 k Xkf(xj\6kj), combined with the 
assumption that the response variables Xj are independent conditional on 
membership scores, fully defines the distribution of observed responses Xj for 
each subject. In addition, given the membership scores, we take the observed 
responses from different subjects to be independent. 

Latent variable level. We can either assume that the latent variables are 
fixed unknown constants or that they are random realizations from some 
underlying distribution. 

1. If the membership scores A are fixed but unknown, then 



is the conditional probability of observing Xj , given the membership 
scores A and parameters 0. 

2. If the membership scores A are realizations of latent variables from some 
distribution D a , parameterized by a , then 



K 




(2) 




(3) 



is the marginal probability of observing Xj , given the parameters. 
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Sampling scheme. Suppose we observe R independent replications of J dis- 
tinct characteristics for one subject, {a^ r \ . . . , If the membership 

scores are realizations from the distribution D a , the conditional probability 
is 

j — i r=i fc= i 

If we treat the latent variables as unknown constants, we get an analogous 
representation for the conditional probability of observing R replications of 
J variables. In general, the number of observed characteristics J need not be 
the same across subjects, and the number of replications R need not be the 
same across observed characteristics. 

This mixed membership model framework unifies several specialized mod- 
els that have been developed independently in the social sciences, in genetics, 
and in machine learning. Each corresponds to different choices of J and R, 
and different latent variable assumptions. For example, the standard GoM 
model of Woodbury and Clive (1974) and Manton et al. (1994) assumes that 
we observe responses to J survey questions without replications, i.e., R = 1, 
and treats the membership scores as fixed unknown constants (fixed-effects) . 
Examples of the “fixed-effect” GoM analyses include but are not limited to: an 
analysis mentioned earlier of DSM-III psychiatric classifications in Nurnberg 
et al. (1999), a study of data on remote sensing (Talbot (1996)), an analysis 
of business opportunities (Talbot et al. (2002)), and a classification of indi- 
vidual tree crowns into species groups from aerial photographs (Brandtberg 
( 2002 )). 

Another class of mixed membership models is based directly on the stan- 
dard GoM model but places a distribution on the membership scores. Thus, 
Potthoff et al. (2000) treat the membership scores as realizations of Dirichlet 
random variables and are able to use marginal maximum likelihood estima- 
tion in a series of classification examples when the number of items J is small. 
Erosheva (2002) provides a Markov chain Monte Carlo estimation scheme for 
the GoM model also assuming the Dirichlet distribution on the membership 
scores. Varki et al. (2000) employ a mixture of point and Dirichlet distribu- 
tions as the generating distribution for the membership scores in their work. 

Independently from the GoM developments, in genetics Pritchard et al. 
(2000) use a clustering model with admixture. For diploid individuals the 
clustering model assumes that R = 2 replications (genotypes) are observed 
at J distinct locations (loci) and that the membership scores are random 
Dirichlet realizations. Again, J and N vary in this and related applications. 
In the Introduction, we briefly described an example of findings obtained 
via this model in the study on race and population genetic structure by 
Rosenberg et al. (2002). 

A standard assumption in machine learning of text and other objects is 
that a single characteristic is observed multiple times. For example, for a 
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text document of length L only one distinct characteristic, a word, is ob- 
served with R = L realizations. In this set-up, the work of Hofmann (2001) 
on probabilistic latent semantic analysis treated membership scores as fixed 
unknown constants and that of Blei et al. (2003) adopted a Dirichlet gen- 
erating distribution for the membership scores. More recently, this line of 
modeling has moved from considering a single characteristic (e.g., words in 
a document) to working with a combination of distinct characteristics. An 
example that we discussed in this area is by Barnard et al. (2003) who mod- 
eled a combination of words and segmented images via a mixed membership 
structure. 

Given this multiplicity of unrelated mixed membership model develop- 
ments, we should not be surprised by the variety of estimation methods 
adopted. Broadly speaking, estimation methods are of two types: those that 
treat membership scores as fixed and those that treat them as random. The 
first group includes the numerical methods introduced by Hofmann (2003) 
and by Kovtun et al. (2004b), and joint maximum likelihood type methods 
described in Manton et al. (1994) and Varki and Cooil (2003) where fixed 
effects for the membership scores are estimated in addition to the popula- 
tion parameter estimates. The statistical properties of the estimates in these 
approaches, such as consistency, identifiability, and uniqueness of solutions, 
are suspect. The second group includes variational estimation methods used 
by Blei et al. (2003), expectation-propagation methods developed by Minka 
and Lafferty (2002), joint maximum likelihood approaches of Potthoff et al. 
(2000) and Varki et al. (2000), and Bayesian MCMC simulations (Pritchard 
et al. (2002), Erosheva (2002, 2003a)). These methods solve some of the sta- 
tistical and computational problems, but many other challenges and open 
questions still remain as we illustrate below. 

3 Disability types among older adults 

3.1 National Long Term Care Survey 

The National Long-Term Care Survey (NLTCS), conducted in 1982, 1984, 
1989, 1994, and 1999, was designed to assess chronic disability in the U.S. 
elderly Medicare-enrolled population (65 years of age or older). Beginning 
with a screening sample in 1982, individuals were followed in later waves 
and additional samples were subsequently added maintaining the sample at 
about 20,000 Medicare enrollees in each wave. The survey aims to provide 
data on the extent and patterns of functional limitations (as measured by 
activities of daily living (ADL) and instrumental activities of daily living 
(IADL), availability and details of informal caregiving, use of institutional 
care facilities, and death. NLTCS public use data can be obtained from the 
Center for Demographic Studies, Duke University. 

Erosheva (2002) considered the mixed membership model with up to K = 
5 subpopulations or extreme profiles for the 16 ADL/IADL measures, pooled 
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across four survey waves of NLTCS, 1982, 1984, 1989, and 1994. For each 
ADL/IADL measure, individuals can be either disabled or healthy. Thus the 
data form a 2 16 contingency table. The table has 65,536 cells, only 3,152 of 
which are non-zero and there are a total of N = 21,574 observations. This 
is a large sparse contingency table that is not easily analyzed using classical 
statistical methods such as those associated with log-linear models. 

3.2 Applying the mixed membership model 

Following the GoM structure for dichotomous variables, we have J = 16 
dichotomous characteristics observed for each individual and the number of 
replications R is 1. For each extreme profile k , the probability distribution 
for characteristic j, f(xj\6kj ) is binomial parameterized by the probability of 
the positive response Hkj- 

We assume that the membership scores follow a Dirichlet distribution D a 
and employ Monte Carlo Markov chain estimation for the latent class repre- 
sentation of the GoM model (Erosheva (2003a)). We obtain posterior means 
for the response probabilities of the extreme profiles and posterior means of 
the membership scores conditional on observed response patterns. Estimated 
response probabilities of the extreme profiles provide a qualitative description 
of the extreme categories of disability as tapped by the 16 ADL/IAD mea- 
sures while the estimated parameters a of the Dirichlet distribution describe 
the distribution of the mixed membership scores in the population. 

Although the Deviance Information Criteria (Spiegelhalter et al. (2002)) 
indicates an improvement in fit for K increasing from 2 to 5 with the largest 
improvement for K going from 2 to 3, other considerations point out that a 
K — 4 solution might be appropriate for this data set (Erosheva (2002)). In 
Table 1, we provide posterior means and standard deviation estimates for the 
parameters of the GoM model with four extreme profiles. The estimates of 
and «o reported in Table 1 and their product gives the vector of Dirichlet 
distribution parameters. The estimated distribution of the membership scores 
is bathtub shaped, indicating that the majority of individual profiles are close 
to estimated extreme profiles. 

One of the most significant findings in this analysis is based on examining 
interpretations of the extreme profiles for the mixed membership models for 
K = 4,5 which rejects the hypothesis of a unidimensional disability structure, 
i.e., the extreme profiles are qualitatively different and can not be ordered by 
severity. In particular, individuals at two of the estimated extreme profiles can 
be described as mostly cognitive and mostly mobility impaired individuals. 
For more details on the analysis and substantive findings see Erosheva (2002). 



18 



Erosheva and Fienberg 



Table 1 . Posterior mean (standard deviation) estimates for K = 4 extreme profiles. 
The ADL items are: (1) eating, (2) getting in/out of bed, (3) getting around inside, 
(4) dressing, (5) bathing, (6) using toilet. The IADL items are: (7) doing heavy 
house work, (8) doing light house work, (9) doing laundry, (10) cooking, (11) grocery 
shopping, (12) getting about outside, (13) traveling, (14) managing money, (15) 
taking medicine, (16) telephoning. 



k 1 2 3 4 



MM 


0.000 


(3e-04) 


0.002 


(2e-03) 


0.001 


(6e- 


-04) 


0.517 


(le-02) 


Mfc, 2 


0.000 


(3e-04) 


0.413 


(le-02) 


0.001 


(5e- 


-04) 


0.909 


(7e-03) 


MM 


0.001 


(5e-04) 


0.884 


(le-02) 


0.018 


(8e- 


-03) 


0.969 


(5e-03) 


MM 


0.007 


(2e-03) 


0.101 


(6e-03) 


0.016 


(4e- 


-03) 


0.866 


(8e-03) 


Hk,5 


0.064 


(4e-03) 


0.605 


(9e-03) 


0.304 


(9e- 


-03) 


0.998 


(2e-03) 


MM 


0.005 


(2e-03) 


0.316 


(9e-03) 


0.018 


(4e- 


-03) 


0.828 


(8e-03) 


MM 


0.230 


(7e-03) 


0.846 


(7e-03) 


0.871 


(7e- 


-03) 


1.000 


(3e-04) 


MM 


0.000 


(2e-04) 


0.024 


(4e-03) 


0.099 


(7e- 


-03) 


0.924 


(7e-03) 


Mfc, 9 


0.000 


(3e-04) 


0.253 


(9e-03) 


0.388 


(le- 


-02) 


0.999 


(le-03) 


Mmo 


0.000 


(2e-04) 


0.029 


(5e-03) 


0.208 


(le- 


-02) 


0.987 


(4e-03) 


Mfc,ll 


0.000 


(3e-04) 


0.523 


(le-02) 


0.726 


(le- 


-02) 


0.998 


(2e-03) 


Hk,l2 


0.085 


(5e-03) 


0.997 


(2e-03) 


0.458 


(le- 


-02) 


0.950 


(4e-03) 


MM3 


0.021 


(4e-03) 


0.585 


(le-02) 


0.748 


(le- 


-02) 


0.902 


(5e-03) 


MM4 


0.001 


(7e-04) 


0.050 


(5e-03) 


0.308 


(le- 


-02) 


0.713 


(8e-03) 


MM 5 


0.013 


(2e-03) 


0.039 


(4e-03) 


0.185 


(8e- 


-03) 


0.750 


(8e-03) 


MM6 


0.014 


(2e-03) 


0.005 


(2e-03) 


0.134 


(7e- 


-03) 


0.530 


(9e-03) 


fk 


0.216 


(2e-02) 


0.247 


(2e-02) 


0.265 


(2e- 


-02) 


0.272 


(2e-02) 


ao 


0.197 


(5e-03) 

















4 Classifying publications by topic 

4.1 Proceedings of the National Academy of Sciences 

The Proceedings of the National Academy of Sciences (PNAS) is the world’s 
most cited multidisciplinary scientific journal. Historically, when submitting 
a research paper to the Proceedings, authors have to select a major category 
from Physical, Biological, or Social Sciences, and a minor category from the 
list of topics. PNAS permits dual classifications between major categories 
and, in exceptional cases, within a major category. The lists of topics change 
over time in part to reflect changes in the National Academy sections. Since 
in the nineties the vast majority of the PNAS research papers was in the Bi- 
ological Sciences, our analysis focuses on this subset of publications. Another 
reason for limiting ourselves to one major category is that we expect papers 
from different major categories to have a limited overlap. 

In the Biological Sciences there are 19 topics. Table 2 gives the percentages 
of published papers for 1997-2001 (Volumes 94-98) by topic and and numbers 
of dual classification papers in each topic. 
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Table 2. Biological Sciences publications in PNAS volumes 94-98, by subtopic, 
and numbers of papers with dual classifications. The numbers in the final column 
represent projections based on our model. 



Topic Number % Dual % Dual More Dual? 



1 


Biochemistry 


2578 


21.517 


33 


18.436 


338 


2 


Medical Sciences 


1547 


12.912 


13 


.263 


84 


3 


Neurobiology 


1343 


11.209 


9 


5.028 


128 


4 


Cell Biology 


1231 


10.275 


10 


5.587 


111 


5 


Genetics 


980 


8.180 


14 


7.821 


131 


6 


Immunology 


865 


7.220 


9 


5.028 


39 


7 


Biophysics 


636 


5.308 


40 


22.346 


62 


8 


Evolution 


510 


4.257 


12 


6.704 


167 


9 


Microbiology 


498 


4.157 


11 


6.145 


42 


10 


Plant Biology 


488 


4.073 


4 


2.235 


54 


11 


Developmental Biology 


366 


3.055 


2 


1.117 


43 


12 


Physiology 


340 


2.838 


1 


0.559 


34 


13 


Pharmacology 


188 


1.569 


2 


1.117 


34 


14 


Ecology 


133 


1.110 


5 


2.793 


16 


15 


Applied Biological Sciences 


94 


0.785 


6 


3.352 


7 


16 


Psychology 


88 


0.734 


1 


0.559 


22 


17 


Agricultural Sciences 


43 


0.359 


2 


1.117 


8 


18 


Population Biology 


43 


0.359 


5 


2.793 


4 


19 


Anthropology 


10 


0.083 


0 


0 


2 




Total 


11981 


100 


179 


100 


1319 



4.2 Applying the mixed membership model 

The topic labels provide an author-designated classification structure for pub- 
lished materials. Notice that the vast majority of the articles are members 
of only a single topic. We represent each article by collections of words in 
the abstract and references in the bibliography. For our mixed membership 
model, we assume that there is a fixed number of extreme categories or as- 
pects, each of which is characterized by multinomial distributions over words 
(in abstracts) and references (in bibliographies) . A distribution of words and 
references in each article is given by the convex combination of the aspects’ 
multinomials weighted by proportions of the article’s content coming from 
each category. These proportions, or membership scores, determine soft clus- 
tering of articles with respect to the internal categories. 

Choosing a suitable value for the number of internal categories or aspects, 
K , in this type of setting is difficult. We have focused largely on two versions 
of the model, one with eight aspects and the other with ten. The set of para- 
meters in our model is given by multinomial word and reference probabilities 
for each aspect, and by the parameters of Dirichlet distribution, which is a 
generating distribution for membership scores. There are 39,616 unique words 
and 77,115 unique references in our data, hence adding an aspect corresponds 
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to having 39,615 + 77,114 + 1 = 116,730 additional parameters. Because of 
the large numbers of parameters involved, it is difficult to assess the extent to 
which the added pair of aspects actually improve the fit of the model to the 
data. In a set of preliminary comparisons we found little to choose between 
them in terms of fit and greater ease of interpretation for the eight aspect 
model. In Erosheva et al. (2004) we report on the details of the analysis of 
the K — 8 aspect model and its interpretation and we retain that focus here. 

From our analysis of high probability words and references, the 8 aspects 
of our model have the following interpretation: 

1. Intracellular signal transaction, neurobiology. 

2. Evolution, molecular evolution. 

3. Plant molecular biology. 

4. Developmental biology; brain development. 

5. Biochemistry, molecular biology; protein structural biology. 

6. Genetics, molecular biology; DNA repair, mutagenesis, cell cycle. 

7. Tumor immunology; HIV infection. 

8. Endocrinology, reporting of experimental results; molecular mechanisms 
of obesity. 

Based on the interpretations, it is difficult to see whether the majority of as- 
pects correspond to a single topic from the official PNAS classifications. To in- 
vestigate a correspondence between the estimated aspects and the given top- 
ics further, we examine aspect “loadings” for each paper. Given estimated pa- 
rameters of the model, the distribution of each article’s “loadings” can be ob- 
tained via Bayes’ theorem. The variational and expectation-propagation pro- 
cedures give Dirichlet approximations to the posterior distribution p(A(d), 0 ) 
for each document d. We employ the mean of this Dirichlet as an estimate of 
the “weight” of the document on each aspect. 

We can gauge the sparsity of the loadings by the parameters of the Dirich- 
let distribution, which for the K = 8 model we estimate as ay = 0.0195, a 2 = 
0.0203, a 3 = 0.0569, a 4 = 0.0346, a 5 = 0.0317, a 6 = 0.0363, a 7 = 0.0411, 
as = 0.0255. This estimated Dirichlet, which is the generative distribution of 
membership scores, is “bathtub shaped” on the simplex; as a result, articles 
will tend to have relatively high membership scores in only a few aspects. 

To summarize the aspect distributions for each topic, we provide a graph- 
ical representation of these values for K = 8 and K = 10 in Figure 1 and 
Figure 2, respectively. Examining the rows of Figure 1, we see that, with the 
exception of Evolution and Immunology, the subtopics in Biological Sciences 
are concentrated on more than one internal category. The column decomposi- 
tion, in turn, can assist us in interpreting the aspects. Aspect 8, for example, 
which from the high probability words seems to be associated with the re- 
porting of experimental results, is the aspect of origin for a combined 37% 
of Physiology, 30% of Pharmacology, and 25% of Medical Sciences papers, 
according to the mixed membership model. 
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Fig. 1 . Graphical representation of mean decompositions of aspect membership 
scores for K = 8. Source: Erosheva et al.(2004). 




Fig. 2. Graphical representation of mean decompositions of aspect membership 
scores for K = 10. 



Finally, we compare the loadings (posterior means of the membership 
scores) of dual-classified articles to those that are singly classified. We con- 
sider two articles as having similar membership vectors if their loadings are 
equal for the first significant digit for all aspects. One might consider singly 
classified articles that have membership vectors similar to those of dual- 
classified articles as interdisciplinary, i.e. , the articles that should have had 
dual classification but did not. We find that, for 11 percent of the singly 
classified articles, there is at least one dual-classified article that has similar 
membership scores. For example, three biophysics dual-classified articles with 
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loadings 0.9 for the second and 0.1 for the third aspect turned out to have 
similar loading to 86 singly classified articles from biophysics, biochemistry, 
cell biology, developmental biology, evolution, genetics, immunology, medical 
sciences, and microbiology. In the last column of Table 2, we give the numbers 
of projected additional dual classification papers by PNAS topic. 

4.3 An alternative approach with related data 

Griffiths and Steyvers (2004) use a related version of the mixed membership 
model on the words in PNAS abstracts for the years 1991-2001, involving 
28,154 abstracts. Their corpus involves 20,551 words that occur in at least five 
abstracts, and are not on the “stop list” . Their version of the model does not 
involve the full hierarchical probability structure. In particular, they employ 
Dirichlet(a) distribution for membership scores A, but they fix a at 50 /K, 
and a Dirichlet (/3) distribution for aspect word probabilities, but they fix j3 
at 0.1. These choices lead to considerable computational simplification that 
allows using a Gibbs sampler for the Monte Carlo computation of marginal 
components of the posterior distribution. 

In Griffiths and Steyvers (2004) they report on estimates of Pr(data\K) 
for K= 50, 100, 200, 300, 400, 500, 600, 1000, integrating out the latent vari- 
able values. They then pick K to maximize this probability. This is referred 
to in the literature as a maximum a posteriori (MAP) estimate (e.g., see 
Denison et al. (2002)), and it produces a value of K approximately equal to 
300, more than an order of magnitude greater than our value of K = 8. 

There are many obvious and some more subtle differences between our 
data and those analyzed by Griffiths and Steyvers as well as between our 
approaches. Their approach differs from ours because of the use of a words- 
only model, as well as through the simplification involving the fixing of the 
Dirichlet parameters and through a more formal selection of dimensionality. 
While we can not claim that a rigorous model selection procedure would 
estimate the number of internal categories close to 8, we believe that a high 
number such as K = 300 is at least in part an artifact of the data and 
analytic choices made by Griffiths and Steyvers. For example, we expect 
that using the class of Dirichlet distributions with parameters 50 /K when 
K > 50 for membership scores biases the results towards favoring many more 
categories than there are in the data due to increasingly strong preferences 
towards extreme membership scores with increasing K. Moreover, the use 
of the MAP estimate of K has buried within it an untenable assumption, 
namely that Pr{K) constant a priori , and pays no penalty for an excessively 
large number of aspects. 

4.4 Choosing K to describe PNAS topics 

Although the analyses in the two preceding subsections share the same gen- 
eral goal, i.e. , detecting the underlying structure of PNAS research publica- 
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tions, they emphasize two different levels of focus. For the analysis of words 
and references in Erosheva et al. (2004), we aimed to provide a succinct high- 
level summary of the population of research articles. This led us to narrow 
our focus to research reports in biology and to keep the numbers of topics 
within the range of the current classification scheme. We found the results 
for K = 8 aspects were more easily interpretable than those for I\ = 10 but 
because of time and computational expense we did not explore more fully the 
choice of K. 

For their word-only model, Griffiths and Steyvers (2004) selected the 
model based on K = 300 which seems to be aimed more at the level of 
prediction, e.g., obtaining the most detailed description for each paper as 
possible. They worked with a database of all PNAS publications for given 
years and considered no penalty for using a large number of aspects such 
as that which would be associated with the Bayesian Information Criterion 
applied to the marginal distributions integrating out the latent variables. 

Organizing aspects hierarchically, with sub-aspects having mixed mem- 
bership in aspects, might allow us to reconcile our higher level topic choices 
with their more fine-grained approach. 

5 Summary and concluding remarks 

In this paper we have described a Bayesian approach to a general mixed 
membership model that allows for: 

• Identification of internal clustering categories (unsupervised learning) . 

• Soft or mixed clustering and classifications. 

• Combination of types of characteristics, e.g., numerical values and cate- 
gories, words and references for documents, features from images. 

The ideas behind the general model are simple but they allow us to view 
seemingly disparate developments in soft clustering or classification prob- 
lems in diverse fields of application within the same broad framework. This 
unification has at least two saluatory implications: 

• Developments and computational methods from one domain can be im- 
ported to or shared with another. 

• New applications can build on the diverse developments and utilize the 
general framework instead of beginning from scratch. 

When the GoM model was first developed, there were a variety of im- 
pediments to its implementation with large datasets, but the most notable 
were technical issues of model identifiability and consistency of estimation, 
since the number of parameters in the model for even a modest number of 
groups (facets) is typically greater than the number of observations, as well as 
possible multi-modal likelihood functions even when the model was properly 
identified. These technical issues led to to practical computational problems 
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and concerns about the convergence of algorithms. The Bayesian hierarchical 
formulation described here allows for solutions to a number of these diffi- 
culties, even in high dimensions, as long as we are willing to make some 
simplifying assumptions and approximations. Many challenges remain, both 
statistical and computational. These include computational approaches to 
full posterior calculations; model selection (i.e. , choosing K), and the devel- 
opment of extensions of the model to allow for both hierarchically structured 
latent categories and dependencies associated with longitudinal structure. 
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Abstract. The primary structure of a protein is the sequence of its amino acids. 
The secondary structure describes structural properties of the molecule such as 
which parts of it form sheets, helices or coils. Spacial and other properties are 
described by the higher order structures. The classification task we are considering 
here, is to predict the secondary structure from the primary one. To this end we 
train a Markov model on training data and then use it to classify parts of unknown 
protein sequences as sheets, helices or coils. We show how to exploit the directional 
information contained in the Markov model for this task. Classifications that are 
purely based on statistical models might not always be biologically meaningful. We 
present combinatorial methods to incorporate biological background knowledge to 
enhance the prediction performance. 



1 Introduction 

The primary structure of a DNA-sequence is given by the sequence of its 
amino-acids. The secondary structure is a classification of contiguous stretch- 
es of a DNA-molecule according to their conformation. We use a threefold 
classification, namely the conformation helices , sheets , and coils. Most data- 
bases contain a finer classification into 6 or more classes. We use the mapping 
from Gamier et al. (1996) and Kloczkowski et al. (2002) to reduce to the three 
aforementioned classes. 

The task is to determine the secondary structure from the primary one. 
We use a supervised learning approach for this purpose. From a database one 
collects a number of DNA-sequences for which the classifications are known. 
On these a (statistical) model is trained and then used to assign classifications 
to new, unclassified protein sequences. There is a number of such classifiers 
which are based on different learning concepts. Some use statistical methods 
like, e.g., the GOR algorithm, Gamier et al. (1996) and Kloczkowski et al. 
(2002). GOR are the initials of the authors of the first version of this method: 
Gamier, Osguthorpe, and Robson. Other algorithms rely on neural networks 
like PHD, Rost and Sander (1993) and (1994). The acronym means “Pro- 
file network from HD”, where HD is the number plate code for Heidelberg, 
Germany, where the authors worked. Most of them incorporate biological 
background knowledge at some stage. For example a first classification given 
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by a statistical model is then checked for biological plausibility and, if neces- 
sary, corrected. 

We use a first order Markov model as classifier. This type of classifier 
has been successfully used before in a related setting, Brunnert et al. (2002). 
There, the order and length of the helix and sheet subsequences was given 
(but no information on the intermediate coil parts was known). Here, we 
investigate how this classifier performs without the additional information 
on order and length and how its performance can be improved. The aim is 
to push the basic statistical method to its limits before combining it with 
other techniques. We use the GOR algorithms as references. They have been 
re-implemented without the incorporation of background knowledge. 



2 The method 

Let E a denote the alphabet for the 20 amino acids, and let S c = {H,E,C} 
denote the classification alphabet, where H denotes helix, E sheet, and C coil. 
In the following let x = x \, . . . , x n be a protein sequence, where Xi £ E a . Let 
||x|| denote its length. Let C = Ci, ... ,c n be the corresponding classification 
sequence, Cj £ E c . 

We shall use a first order Markov model for the prediction. The model 
uses a parameter the window size. Such a model assigns probabilities p to 
subsequences of x of length l as follows: 

p(xi, . . . ,x i+ e-i) =p(xi)p(x i+ i | ■■■p{x i+ t- 1 | x i+ e- 2 ) (1) 

For the threefold classification task we have in mind, three such models are 
used, one for each of the classes {H, E, C}. The probability functions of the 
respective models are denoted by pn, Pe , and pc ■ The three models are 
trained by estimating their parameters of the kinds px ( ■ ) and px ( • | • ) 
X £ {H, E, C}. Then they can be used for classification of new sequences as 
follows: One evaluates all three models and then assigns that class which cor- 
responds to the model with highest probability: argma x{ph,Pe,Pc}- The ob- 
vious problem with this approach is, that a Markov model assign probabilities 
to subsequences (windows) and not to individual amino acids. This might lead 
to conflicting predictions. If, for example, E = argmax^ {px(x \, . . . , a^_i)} 
and H = arg max x {px (x 2 , . . . , xe)}, it is not clear which of the two classifi- 
cations X 2 should get. We choose to assign the classification of a window to 
the first amino acid in that window. The estimation of the model parameters 
is then performed to support this choice. We denote this by using the term 
p(i) for this, i.e., 

Px(i) = px{xi)px{x i+ 1 | Xi) • ■■px{xi+i-i I x i+ e- 2 ) (2) 

We investigated several modifications of the Markov model, some of which 
also differ in the training process. The basic training is conducted as follows. 



Predicting Protein Secondary Structure with Markov Models 



29 



The training data consists of N DNA-sequences and the corresponding 
classification sequences c^\ j = 1 ,N. Now, three sets of subsequences 
are constructed, one for each of the three classes. Each DNA-sequence 
is divided into maximal substrings according to the classification c ^' 1 : Let 
zj^M+i ‘ ' ' x k+i- 1 be sucb a subsequence. Then c b) = cj^i = • • • = c^_i 

and either k = 1 or c^2i 7^ cj^ and either fc + £ — 1 = ||x|| or c^_i 7 ^ c i+^- 
When we use the term subsequence in the following we mean such a maximal 
subsequence. We denote the three collections of subsequences by Sh, Se , and 
Sc- On each of these sets a Markov model is trained by estimating its para- 
meters. Let Mh, Me, and Me be the respective models. The estimations 
are the relative frequencies of (pairs of) residues in the training data. To 
avoid zero empirical probabilities, we introduce a pseudocount value c > 0, 
where c = 0 is the estimation without pseudocounts. Let X £ {H,E,C} be 
the class and let denote the maximal subsequences. Then the estimations 
are 



Px(a ) 



Px{b\a) 



{j I y U) £S X A y[ j) = a} 



I A7q. I c + |iS_y| 



c T 



{ (b j) I y (i) G«Sy A 1 <i < ||y (i) || A y\ 3) = a A = bj | 



{(b j) I y (i) G Sx A 1 < i < 1 1 y O’) || A y\ o) = a| 



+ I Sj a I c 



( 3 ) 

( 4 ) 



3 Improvements 

The basic classification method described above has been analysed and mod- 
ified in order to detect the importance of the various parameters and to 
improve its performance. The tests were carried out while maintaining the 
statistical nature of the approach. No biological background knowledge was 
incorporated. Also, the method was not combined with other techniques. The 
aim was to push the performance of the basic method as far as possible before 
applying other techniques. In the following we describe the modifications and 
their influence on the performance. 

The results shown here come from tests performed in Larsen and Thomsen 
(2004) on the GOR data set (Gamier et al. (1996)), which consists of 267 
protein sequences. It was evaluated using a leave-one-out cross-validation. 
We also used the benchmark data set of 513 protein sequences. The results 
on the latter set showed no relevant difference to those on the GOR data set. 
Due to the structure of the Markov model with window size £, the last £ — 1 
residuals of a sequence cannot be classified. The percent figures thus are the 
ratios of correctly classified residuals and all classified residuals. 

Pseudocount and window size: These two parameters have been var- 
ied independently. The window size parameter £ is the number of terms used 
in the Markov expansion (1). The range for the window size was 1 through 10. 
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One would expect that a very small window size results in bad performance, 
because too few information is used in the classification process. Also very 
large window sizes should decrease the performance because the local infor- 
mation is blurred by far off data. The pseudocount parameter c was varied 
from 0 through 1000. The effect of this parameter depends on the size of the 
training set. In our case the set was so large, that no zero empirical probabil- 
ities occurred. Nevertheless, the performance of the classifier was improved 
when using small positive pseudocount values. We believe that this is due to 
the fact, that statistical fluctuation in small (imprecisely estimated) empirical 
probabilities are leveled by this. 

The optimal choice of the parameters was a window size of 5 and a pseudo- 
count value of 5. These settings were used in all following results. We also 
varied the window size and pseudocount constant in combination with other 
modifications but the aforementioned values stayed optimal. Figure 1 shows 
a plot of the test results. With this choice, the basic model has a classification 
rate (number of correctly classified residuals) of 51.0%. The naive classifica- 
tion - constantly predicting the most frequent residual (coil) - would give 
43%. 



The Basic Markov Model 
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Fig. 1 . A contour plot of the prediction performance of the Markov model as a 
function of the window size and the pseudocount constant. The vertical axis is 
from SI = 0 to S7 = 15 



Estimation of px(a): In Equation (3) the parameter px(a) was esti- 
mated as the empirical frequency of a as a first letter of a maximal subse- 
quence with classification X. This definition stems from the application in 
Brunnert et al. (2002) where additional knowledge on the structure of the 
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subsequences (length/order) was available. We changed the estimation (3) to 



Px(a) 



c + 



{( i,j ) I y 0) e S x /\y. 



O') 



\ s a\c + J2i |y (i) | 




(5) 



the frequency of the letter a in all subsequences with classification X. Us- 
ing this definition improved the classification performance by 1.4 percentage 
points. The increase was expected, because the information on residuals in 
the middle of the subsequence is increased. 

Estimation of p(a\b): Instead of using Equation (4) to estimate the 
conditional probabilities, we also considered the reversed sequence. That is 
we computed pf orw (a\b) as in Equation (4) and p rev (a\b) as in Equation (4) 
but on the reversed sequence. Then we set p(a\b) to the sum of pf ore {a\b) and 
p rev (a\b) and normalize to get a probability distribution. Using this definition 
of p(a\b) improved the prediction performance by 2 percentage points. 

Direction: Markov models exploit directional information. We therefore 
tried another modification, namely to reverse the sequences in the train- 
ing and the classification process. We did not expect a significant increase 
from this. To our surprise the classification performance was increased by 
1.5 percentage points when using reversed sequences. This indicates that the 
sequence data is more informative in one direction than in the other one. 

Momentum: This variation of the basic method tries to achieve a more 
“stable” classification as the classification window moves along the DNA- 
sequence. To this end we consider the discounted values of previous classifi- 
cations. The new classification value, denoted by p' x (i), replaces the original 
values px{i) from (2) and is defined by 



Pxi !) =Px{ f) 

p' x (i) = w ■ p' x (i - 1) + (1 - w) ■ p x {i) 



To determine a good value for the discount constant w, different settings of 
w £ [0, 1] were tested. The choice of w = 0.5 showed the best results with an 
increase of 4.3 percentage points in the prediction performance. 

One can say that the use of a momentum term does model some biological 
knowledge. It is known that helix, coil, or sheet subsequences usually consist 
of a number of amino acids, not just a single one. The momentum method 
eliminates a number of subsequences of length one from the prediction. This 
often replaces the old prediction by the correct one, which results in the 
better performance. 

Combination of methods: A number of combinations of the above 
methods were tested. Combining the definition given in Equation (5) for 
p(a), the momentum and the modified definition of p(a\b) proved to be the 
most successful one. It resulted in the considerable increase of the prediction 
performance of 6.3 percentage points resulting in 57.3%. 
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Our implementations of the GOR algorithm versions I, III and IV, all 
without the incorporation of background knowledge and with window size 17, 
gave classification rates of 60.7%, 59.6%, and 63.4%. It is not surprising that 
the GOR-algorithms outperform the Markov approach, as it uses a statistic 
of all pairs in the window. It is however surprising, how close one can come 
to versions I and III of GOR. 

4 Ongoing research 

We are currently considering “peaks” of the probabilities. The idea of using 
the concept of a peak is motivated by the shapes of the graphs of the three 
probability functions pe (*), PH(i), and pc{i)- Often the function px has a 
peak at the first residuum of a A"-subsequence. See Figure 2 for an example. 
The peaks are more prominent when using the original definition (3) of the 
termp(a) than that given in (5). A peak could be used as indicator of the start 
of a new subsequence. Then the corresponding classification is maintained 
until a peak of another probability function is found. 




|Q] Coil group Sheet group ' Helix group 



Fig. 2. The picture shows the probability functions for the three classes. Two peaks 
at the left start points of subsequences are marked by ovals. Below the graphs is 
the protein sequence with the correct classifications shown by colors. The colors in 
the line below show the predictions of the Markov model. 



The problem here is to find an appropriate combinatorial definition of 
the term “peak” . The absolute value of the functions px cannot be used 
due to their strong variation. Also, a peak of one function, say pe , does not 
necessarily exceed the values of the two other functions. On the other hand, 
a peak value of pe should not be ridiculously small relative to the two other 
functions. 

First tests with a simple definition of a peak show that using this concept 
as a start indicator only gives an improvement of 4 percentage points over 
the naive classification leading to 47%. The plan is to incorporate peak in- 
dicators into the Markov method (or other prediction methods). One way of 
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doing this is to compare the peak locations with a prediction given by some 
other method. Then the alignment of a peak with the start of a predicted 
subsequence would raise our confidence in the prediction. If a predicted sub- 
sequence does not coincide with a peak, then the prediction at this location 
should be checked. 

5 Summary 

We have significantly improved a simple statistical prediction method by a 
thorough analysis of the influence of its different components. Now, the next 
step is to incorporate biological background knowledge into the classification 
process and to combine the Markov predictor with other classifiers. The in- 
vestigations also exposed the “peak” concept as a promising alternative for 
using the statistical information. 
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Abstract. The Milestones Project is a comprehensive attempt to collect, docu- 
ment, illustrate, and interpret the historical developments leading to modern data 
visualization and visual thinking. This paper provides an overview and brief tour 
of the milestones content, with a few illustrations of significant contributions to the 
history of data visualization. This forms one basis for exploring interesting ques- 
tions and problems in the use of statistical and graphical methods to explore this 
history, a topic that can be called “statistical historiography.” 



1 Introduction 

The only new thing in the world is the history you don’t know . — Harry S 
Truman 

The graphic portrayal of quantitative information has deep roots. These 
roots reach into the histories of the earliest map-making and visual depic- 
tion, and later into thematic cartography, statistics and statistical graphics, 
medicine, and other fields, which are intertwined with each other. They also 
connect with the rise of statistical thinking and widespread data collection 
for planning and commerce up through the 19th century. Along the way, a 
variety of advancements contributed to the widespread use of data visualiza- 
tion today. These include technologies for drawing and reproducing images, 
advances in mathematics and statistics, and new developments in data col- 
lection, empirical observation and recording. 

From above ground, we can see the current fruit; we must look below to 
understand their germination. Yet the great variety of roots and nutrients 
across these domains, that gave rise to the many branches we see today, are 
often not well known, and have never been assembled in a single garden, to 
be studied or admired. 

The Milestones Project is designed to provide a broadly comprehensive 
and representative catalog of important developments in all fields related to 
the history of data visualization. Toward this end, a large collection of images, 
bibliographical references, cross-references and web links to commentaries on 
these innovations has been assembled. 
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This is a useful contribution in its own right, but is a step towards larger 
goals as well. First, we see this not as a static collection, but rather a dynamic 
database that will grow over time as additional sources and historical contri- 
butions are uncovered or suggested to us. Second, we envisage this project as 
providing a tool to enable researchers to work with or study this history, find- 
ing themes, antecedents, influences, patterns, trends, and so forth. Finally, 
as implied by our title, work on this project has suggested several interesting 
questions subsumed under the self-referential term “statistical historiogra- 
phy.” 

1.1 The Milestones Project 

The past only exists insofar as it is present in the records of today. And what 

those records are is determined by what questions we ask. (Wheeler (1982), 

p. 24) 

There are many historical accounts of developments within the fields of 
probability (Hald (1990)), statistics (Pearson (1978), Porter (1986), Stigler 
(1986)), astronomy (Riddell (1980)), cartography (Wallis and Robinson 
(1987)), which relate to, inter alia, some of the important developments 
contributing to modern data visualization. There are other, more special- 
ized accounts, which focus on the early history of graphic recording (Hoff 
and Geddes (1959), Hoff and Geddes (1962)), statistical graphs (Funkhouser 
(1936), Funkhouser (1937), Royston (1970), Tilling (1975)), fitting equations 
to empirical data (Farebrother (1999)), cartography (Friis (1974), Kruskal 
(1977)) and thematic mapping (Palsky (1996), Robinson (1982)), and so 
forth; (Robinson (1982, Ch. 2)) presents an excellent overview of some of 
the important scientific, intellectual, and technical developments of the 15 th - 
18 th centuries leading to thematic cartography and statistical thinking. 

But there are no accounts that span the entire development of visual 
thinking and the visual representation of data, and which collate the contri- 
butions of disparate disciplines. In as much as their histories are intertwined, 
so too should be any telling of the development of data visualization. Another 
reason for interweaving these accounts is that practitioners in these fields to- 
day tend to be highly specialized, often unaware of related developments 
in areas outside their domain, much less their history. Extending (Wheeler 
(1982)), the records of history also exist insofar as they are collected, illus- 
trated, and made coherent. 

The initial step in portraying the history of data visualization was a sim- 
ple chronological listing of milestone items with capsule descriptions, bibli- 
ographic references, markers for date, person, place, and links to portraits, 
images, related sources or more detailed commentaries. Its current public 
and visible form is that of hyper-linked, interactive documents available 
on the web and in PDF form (http://www.math.yorku.ca/SCS/Gallery/ 
milestone/). We started with the developments listed by (Beniger and Ro- 
byn (1978)) and incorporated additional listings from Hankins (1999)), Tufte 
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(1983), Tufte (1990), Tufte (1997)), (Heiser (2000)), and others. With assis- 
tance from Les Chevaliers , many other contributions, original sources, and 
images have been added. As explained below, our current goal is to turn this 
into a true multi-media database, which can be searched in flexible ways and 
can be treated as data for analysis. 

2 Milestones tour 

In organizing this material, it proved useful to divide history into epochs, each 
of which turned out to be describable by coherent themes and labels. In the 
larger picture — recounting the history of data visualization — each milestone 
item has a story to be told: What motivated this development? What was 
the communication goal? How does it relate to other developments? What 
were the pre-cursors? What makes it a milestone? To illustrate, we present 
just a few exemplars from a few of these periods. For brevity, we exclude the 
earliest period (pre-17 t?l century) and the most recent period (1975-present) 
in this description. 

2.1 1600-1699: Measurement and theory 

Among the most important problems of the 17th century were those con- 
cerned with physical measurement — of time, distance, and space - for as- 
tronomy, surveying, map making, navigation and territorial expansion. This 
century also saw great new growth in theory and the dawn of practice — the 
rise of analytic geometry, theories of errors of measurement and estimation, 
the birth of probability theory, and the beginnings of demographic statistics 
and “political arithmetic.” 

As an example, Figure 1 shows a 1644 graphic by Michael Florent van 
Langren, a Flemish astronomer to the court of Spain, believed to be the first 
visual representation of statistical data (Tufte (1997, p. 15)). At that time, 
lack of a reliable means to determine longitude at sea hindered navigation and 
exploration. 1 This ID line graph shows all 12 known estimates of the differ- 
ence in longitude between Toledo and Rome, and the name of the astronomer 
(Mercator, Tycho Brahe, Ptolemy, etc.) who provided each observation. 

What is notable is that van Langren could have presented this information 
in various tables — ordered by author to show provenance, by date to show 
priority, or by distance. However, only a graph shows the wide variation in 
the estimates; note that the range of values covers nearly half the length of 
the scale. Van Langren took as his overall summary the center of the range, 
where there happened to be a large enough gap for him to inscribe “ROMA.” 
Unfortunately, all of the estimates were biased upwards; the true distance 
(16°30') is shown by the arrow. Van Langren’s graph is also a milestone 

1 For navigation, latitude could be fixed from star inclinations, but longitude re- 
quired accurate measurement of time at sea, an unsolved problem until 1765. 
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Fig. 1. Langren’s 1644 graph of determinations of the distance, in longitude, from 
Toledo to Rome. The correct distance is 16°30 / . Source: Tufte (1997, p.15.) 



as the earliest-known exemplar of the principle of “effect ordering for data 
display” (Friendly and Kwan (2002)). 

2.2 1700-1799: New graphic forms 

The 18th century witnessed, and participated in, the initial germination of 
the seeds of visualization that had been planted earlier. Map-makers began to 
try to show more than just geographical position on a map. As a result, new 
graphic forms (isolines and contours) were invented, and thematic mapping 
of physical quantities took root. Towards the end of this century, we see the 
first attempts at the thematic mapping of geologic, economic, and medical 
data. 

Abstract graphs, and graphs of functions were introduced, along with the 
early beginnings of statistical theory (measurement error) and systematic 
collection of empirical data. As other (economic and political) data began to 
be collected, some novel visual forms were invented to portray them, so the 
data could “speak to the eyes.” 

As well, several technological innovations provided necessary nutrients. 
These facilitated the reproduction of data images (color printing, lithogra- 
phy), while other developments eased the task of creating them. Yet, most 
of these new graphic forms appeared in publications with limited circulation, 
unlikely to attract wide attention. 

William Playfair (1759-1823) is widely considered the inventor of most of 
the graphical forms widely used today — first the line graph and bar chart 
(Playfair (1786)), later the pie chart and circle graph (Playfair (1801)). A 
somewhat later graph (Playfair (1821)), shown in Figure 2, exemplifies the 
best that Playfair had to offer with these graphic forms. Playfair used three 
parallel time series to show the price of wheat, weekly wages, and reigning 
monarch over a ~250 year span from 1565 to 1820, and used this graph to 
argue that workers had become better off in the most recent years. 

2.3 1800-1850: Beginnings of modern graphics 

With the fertilization provided by the previous innovations of design and 
technique, the first half of the 19th century witnessed explosive growth in 
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Fig. 2. William Playfair’s 1821 time series graph of prices, wages, and ruling 
monarch over a 250 year period. Source : Playfair (1821), image from Tufte (1983, 
P- 34) 



statistical graphics and thematic mapping, at a rate which would not be 
equalled until modern times. 

In statistical graphics, all of the modern forms of data display were in- 
vented: bar and pie charts, histograms, line graphs and time-series plots, 
contour plots, scatterplots, and so forth. In thematic cartography, mapping 
progressed from single maps to comprehensive atlases, depicting data on a 
wide variety of topics (economic, social, moral, medical, physical, etc.), and 
introduced a wide range of novel forms of symbolism. 

To illustrate this period, we choose an 1844 “tableau-graphique” (Fig- 
ure 3) by Charles Joseph Minard, an early progenitor of the modern mosaic 
plot (Friendly (1994)). On the surface, mosaic plots descend from bar charts, 
but Minard introduced two simultaneous innovations: the use of divided and 
proportional-width bars so that area had a concrete visual interpretation. The 
graph shows the transportation of commercial goods along one canal route in 
France by variable-width, divided bars (Minard (1844)). In this display the 
width of each vertical bar shows distance along this route; the divided bar 
segments have height ~ amount of goods of various types (shown by shading) , 
so the area of each rectangular segment is proportional to cost of transport. 
Minard, a true visual engineer (Friendly (2000)), developed such diagrams to 
argue visually for setting differential price rates for partial vs. complete runs. 
Playfair had tried to make data “speak to the eyes,” but Minard wished to 
make them “calculer par l’oeil” as well. 
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Fig. 3. Minard’s Tableau Graphique, showing the transportation of commercial 
goods along the Canal du Centre (Chalon-Dijon). Intermediate stops are spaced 
by distance, and each bar is divided by type of goods, so the area of each tile 
represents the cost of transport. Arrows show the direction of transport. Source: 
ENPC:5860/C351 (Col. et cliche ENPC; used by permission) 



2.4 1850-1900: The Golden Age of statistical graphics 

By the mid-1800s, all the conditions for the rapid growth of visualization had 
been established. Official state statistical offices were established throughout 
Europe, in recognition of the growing importance of numerical information for 
social planning, industrialization, commerce, and transportation. Statistical 
theory, initiated by Gauss and Laplace, and extended to the social realm by 
Quetelet, provided the means to make sense of large bodies of data. 

What started as the Age of Enthusiasm (Palsky (1996)) for graphics may 
also be called the Golden Age, with unparalleled beauty and many innovations 
in graphics and thematic cartography. 

2.5 1900-1950: The modern dark ages 

If the late 1800s were the “golden age” of statistical graphics and thematic 
cartography, the early 1900s could be called the “modern dark ages” of visu- 
alization (Friendly and Denis (2000)). 
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There were few graphical innovations, and, by the mid-1930s, the enthusi- 
asm for visualization which characterized the late 1800s had been supplanted 
by the rise of quantification and formal, often statistical, models in the social 
sciences. Numbers, parameter estimates, and, especially, standard errors were 
precise. Pictures were — well, just pictures: pretty or evocative, perhaps, but 
incapable of stating a “fact” to three or more decimals. Or so it seemed to 
statisticians. 

But it is equally fair to view this as a time of necessary dormancy, ap- 
plication, and popularization, rather than one of innovation. In this period 
statistical graphics became main stream. It entered textbooks, the curricu- 
lum, and standard use in government, commerce and science. In particular, 
perhaps for the first time, graphical methods proved crucial in a number of 
scientific discoveries (e.g. the discovery of atomic number by Henry Mosely, 
lawful clusterings of stars based on brightness and color in the Hertzprung- 
Russell diagrams; see Friendly and Denis (2004) for details.) 



2.6 1950-1975: Re-birth of data visualization 

Still under the influence of the formal and numerical Zeitgeist from the mid- 
1980s on, data visualization began to rise from dormancy in the mid 1960s, 
spurred largely by three significant developments: 

(a) In the USA, John W. Tukey began the invention of a wide variety of 
new, simple, and effective graphic displays, under the rubric of “Exploratory 
Data Analysis.” (b) In France, Jacques Bertin published the monumental 
Semiologie Graphique (Bertin (1967), Bertin (1983)). To some, this appeared 
to do for graphics what Mendeleev had done for the organization of the 
chemical elements, that is, to organize the visual and perceptual elements of 
graphics according to the features and relations in data, (c) Finally, computer 
processing of data had begun, and offered the possibility to construct old and 
new graphic forms by computer programs. True high-resolution graphics were 
developed, but would take a while to enter common use. 

By the end of this period significant intersections and collaborations would 
begin: (a) computer science research (software tools, C language, UNIX, etc.) 
at Bell Laboratories (Becker (1994)) and elsewhere would combine forces with 
(b) developments in data analysis (EDA, psychometrics, etc.) and (c) display 
and input technology (pen plotters, graphic terminals, digitizer tablets, the 
mouse, etc.). These developments would provide new paradigms, languages 
and software packages for expressing statistical ideas and implementing data 
graphics. In turn, they would lead to an explosive growth in new visualization 
methods and techniques. 

Other themes began to emerge, mostly as initial suggestions: (a) various 
visual representations of multivariate data (Andrews’ plots, Chernoff faces, 
clustering and tree representations); (b) animations of a statistical process; 
and (c) perceptually-based theory (or just informed ideas) related to how 
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graphic attributes and relations might be rendered to better convey the data 
visually. 



3 Problems and methods in statistical historiography 

As we worked on assembling the Milestones collection, it became clear that 
there were several interesting questions and problems related to conducting 
historical research along these lines. 

3.1 What counts as a Milestone? 

In order to catalog the contributions to be considered as “milestones” in the 
history of data visualization, it is necessary to have some criteria for inclusion: 
for form, content, and substantive domain, as well as for “what counts” as a 
milestone in this context. We deal only with the last aspect here. 

We have adopted the following scheme. First, we decided to consider sev- 
eral types of contributions as candidates: true innovations, important pre- 
cursors and developments or extensions. Second, we have classified these con- 
tributions according to several themes, categories and rubrics for inclusion. 
Attributions without reference here are listed in the Milestones Project web 
documents. 

• Contributions to the development and use of graphic forms. In 

statistical graphics, inventions of the bar chart, pie chart, line plot (all 
attributed to Playfair), the scatterplot (attributed to J.F.W. Herschel; 
see Friendly and Denis (2004)), 3D plots (Luigi Perozzo), boxplot (J. 
W. Tukey), and mosaic plot (Hartigan & Kleiner) provided new ways of 
representing statistical data. In thematic cartography, isolines (Edmund 
Halley), choropleths (Charles Dupin) and flow maps (Henry Harness; C. 
J. Minard) considerably extended the use of a map-based display to show 
more than simple geographical positions and features. 

• Graphic content: data collection and recording. Visual displays 
of information cannot be done without empirical data, so we must also 
include contributions to measurement (geodesy), recording devices, col- 
lection and dissemination of statistical data (e.g., vital statistics, census, 
social, economic data). 

• Technology and enablement. It is evident that many developments 
had technological prerequisites, and conversely that new technology al- 
lowed new advances that could not have been achieved before. These 
include advances in (a) reproduction of printed materials (printing press, 
lithography), (b) imaging (photography, motion pictures), and (c) ren- 
dering (computing, video display). 

• Theory and practice. Under this heading we include theoretical ad- 
vances in the treatment and analysis such as (a) probability theory and 
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notions of errors of measurement, (b) data summarization (estimation 
and modelling), (c) data exposure (e.g., EDA), as well as (d) awareness 
and use of these ideas and methods. 

• Theory and data on perception of visual displays. Graphic displays 
are designed to convey information to the human viewer, but how people 
use and understand this form of communication was not systematically 
studied until recent times. As well, proposals for graphical standards, 
and theoretical accounts of graphic elements and graphic forms provided 
a basis for thinking of and designing visual displays. 

• Implementation and dissemination. New techniques become avail- 
able when they are introduced, but additional steps are needed to make 
them widely accessible and useable. We are thinking here mainly of im- 
plementations of graphical methods in software, but other contributions 
fall under this heading as well. 

3.2 Who gets credit? 

All of the Milestones items are attributed to specific individuals where we 
have reason to believe that names can be reasonably attached. Yet, Stigler’s 
Law of Eponomy (Stigler (1980)) reminds us that standard attributions are 
often not those of priority. The Law in fact makes a stronger claim: “No 
scientific discovery is named after its original discoverer.” As prima facie 
evidence, Stigler attributes the origin of this law to Merton (1973). 

As illustrations, Stigler (1980) states that Laplace first discovered the 
Fourier transform, Poisson first discovered the Cauchy distribution, and both 
de Moivre and Laplace have prior claims to the Gaussian distribution. He 
concludes that epononyms are conveyed by the community of scholars, not 
by historians. 

Thus, although all of the events listed are correctly attributed to their de- 
velopers, it cannot be claimed with certainty that we are always identifying 
the first instance, nor that we give credit to all who have, perhaps indepen- 
dently, given rise to a new idea or method. Similarly, in recent times there 
may be some difficulty distinguishing credit among developers of (a) an un- 
derlying method or initial demonstration, (b) a corresponding algorithm, or 
(c) an available software implementation. 



3.3 Dating milestones 

In a similar way, there is some unavoidable uncertainty in the dates attached 
to milestone items, in a degree which generally increases as we go back in 
time. For example, in the 18 th and 19 th centuries, many papers were first 
read at scientific meetings, but recorded in print some years later; William 
Smith’s geological map of England was first drawn in 1801, but only finished 
and published in 1815; some pre-1600 dates are only known approximately. 
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In textual accounts of history this does not present any problem — one 
can simply describe the circumstances and range of events, dated specifically 
or approximately, contributing to some development. 

It does matter, however, if we wish to treat item dates as data, either for 
retrieval or analysis/display. For retrieval, we clearly want any date within a 
specified range to match; for analysis or display, the end points will sometimes 
be important, but sometimes it will suffice to use a middle value. 

3.4 What is milestones “data” 

The Milestones Project represents ongoing work. We continually update the 
web and pdf versions as we add items and images, many of which have been 
contributed by Les Chevaliers. To make this work, we rely on software tools 
to generate different versions from a single set of document sources, so that 
all versions can be updated automatically. For this, we chose to use MJrjX 
and BibT^X. 

More recently, we have developed tools to translate this material to other 
forms (e.g., XML or CSV) in order to be able to work with it as “data.” In 
doing so, it seemed natural to view the information as coming from three 
distinct sources, that we think of as a relational database, linked by unique 
keys in each, as shown in Figure 4. 

3.5 Analyzing milestones “data” 

Once the milestones data has been re-cast as a database, statistical analysis 
becomes possible. The simplest case is to look at trends over time. Figure 5 
shows a density estimate for the distribution of milestones items from 1500 to 
the present, keyed to the labels for the periods in history. The bumps, peaks 
and troughs all seem interpretable: note particularly the steady rise up to ~ 
1880, followed by a decline through the “modern dark ages” to ~ 1945, then 
the steep rise up to the present. 

If we classify the items by place of development (Europe vs. North Amer- 
ica), other interesting trends appear (Figure 6). The decline in Europe fol- 
lowing the Golden Age was accompanied by an initial rise in North America, 
largely due to popularization (e.g., text books) and significant applications of 
graphical methods, then a steep decline as mathematical statistics held sway. 

3.6 What was he thinking?: Understanding through reproduction 

Historical graphs were created using available data, methods, technology, 
and understanding current at the time. We can often come to a better un- 
derstanding of intellectual, scientific, and graphical questions by attempting 
a re-analysis from a modern perspective. 

Earlier, we showed Playfair’s time-series graph (Figure 2) of wages and 
prices, and noted that Playfair wished to show that workers were better off at 
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• other fields, ~Type 

• MediaRefs 
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Fig. 4. Milestones data as a relational database composed of history-item, biblio- 
graphic, and multimedia databases 



the end of the period shown than at any earlier time. Presumably he wished 
to draw the reader’s eye to the narrowing of the gap between the bars for 
prices and the line graph for wages. Is this what you see? 

What this graph shows directly is quite different than Playfair’s intension. 
It appears that wages remained relatively stable, while the price of wheat 
varied greatly. The inference that wages increased relative to prices is indirect 
and not visually compelling. 

We cannot resist the temptation to give Playfair a helping hand here — 
by graphing the ratio of wages to prices (labor cost of wheat), as shown in 
Figure 7. But this would not have occurred to Playfair, because the idea of 
relating one time series to another by ratios (index numbers) would not occur 
for another half-century (Jevons). See Friendly and Denis (2004) for further 
discussion of Playfair’s thinking. 



3.7 What kinds of tools are needed? 



We have also wondered how other advances in statistics and data visualiza- 
tion could be imported to a historical realm. Among other topics, there has 
recently been a good deal of work in document analysis and classification that 
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Milestones: Time course of developments 




Year 



Fig. 5. The distribution of milestone items over time, shown by a rug plot and 
density estimate. 



suggests an analog of EDA we might call Exploratory Bibliographic Analysis 
(EBA). 

It turns out that there are several instances of software systems that 
provide some basic tools for this purpose. An example is RefViz (http : 
//www . ref viz . com), shown in Figure ??. This software links to common 
bibliographic software (EndNote, ProCite, Reference Manager, etc.), codes 
references using key terms from the title and abstract and calculates an index 
of similarity between pairs of references based on frequencies of co-occurrence. 
Associations between documents can be shown in a color-coded matrix view, 
as in Figure ??, or a galaxy view (combining cluster analysis and MDS), 
and each view offers zoom/unzoom, sorting by several criteria, and querying 
individual documents or collections. 



4 How to visualize a history? 

A timeline is obvious, but has severe limitations. We record a history of over 
8000 years, but only the last 300-400 have substantial contributions. As well, 
a linear representation entails problems of display, resolution and access to 
detailed information, with little possibility to show either content or context. 
We explore a few ways to escape these constraints below. 
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Milestones: Places of development 




Fig. 6. The distribution of milestone items over time, comparing trends in Europe 
and North America. 




Year 



Fig. 7. Redrawn version of Playfair’s time series graph showing the ratio of price 
of wheat to wages, together with a loess smoothed curve. 



4.1 Lessons from the past 

In the milestones collection, we have three examples of attempts to display 
a history visually. It is of interest that all three used essentially the same 
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Fig. 8. RefViz similarity matrix view of a bibliographic database. The popup grid 
is a zoomed display of the region surrounding a selected cell. 



format: a horizontal, linear scale for time, with different content or context 
stacked vertically, as separate horizontal bands. 

We illustrate with Joseph Priestley’s Chart of Biography (Priestley 
(1765)), showing the lifespans of famous people from 1200 BC to 1750 (Fig- 
ure 9). Priestley divided people into two groups: 30 “men of learning” and 
29 “statesmen,” showing each lifespan as a horizontal line. He invented the 
convention of using dots to indicate uncertainty about exact date of birth or 
death. 



4.2 Lessons from the present 

In modern times, a variety of popular publications, mostly in poster form, 
have attempted to portray graphically various aspects of the history of civi- 
lization, geographic regions, or of culture and science. 

For example, Hammond’s Graphic History of Mankind (Figure 10) shows 
the emergence of new cultures and the rise and fall of various empires, nations 
and ethnic groups from the late Stone Age to the present in a vertical format. 
It uses a varying-resolution time scale, quite coarse in early history, getting 
progressively finer up to recent times. It portrays these using flow lines of 
different colors, whose width indicates the influence of that culture, and with 
shading or stripes to show conquest or outside influence. 
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Fig. 9. Priestley’s Chart of Biography. Source-. Priestley (1765) 




Fig. 10. Hammond’s Graphic History of Mankind (first of 5 panels) 



4.3 Lessons from the web 

A large component of the milestones collection is the catalog of graphic images 
and portraits associated with the milestones items. At present, they are stored 
as image files of fixed resolution and size, and presented as hyper-links in the 
public versions. How can we do better, to make this material more easily 
accessible? 
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There are now a number of comprehensive image libraries available on 
the web that provide facilities to search for images by various criteria and in 
some cases to view these at varying resolutions. Among these, David Rum- 
sey’s Map Collection (http : //www . davidrumsey . com) is notable. It provides 
access to a collection of over 8800 historical maps (mostly 18 th -19 th century, 
of North/South America, with some European content) online, extensively 
indexed so they may be searched by author, category, country or region, 
and a large number of other data fields. The maps are stored using Mr. Sid 
technology (http://www.lizardtech.com), which means that they can be 
zoomed and panned in real time. Rumsey provides several different browsers, 
including a highly interactive Java client. 



4.4 Lessons from the data visualization 

Modern data visualization also provides a number of different ideas and ap- 
proaches to multivariate complexity, time and space we may adapt (in a 
self-referential way) to the history of data visualization itself. 

Interactive viewers provide one simple solution to the trade-off between 
detail and scope of a data view through zoom and unzoom, but in the most 
basic implementation, any given view is a linear scaling of the section of the 
timeline that will fit within the given window. 

We can do better by varying resolution continuously as a non-linear, de- 
creasing function of distance from the viewer’s point of focus. For example, 
Figure 11 shows a fisheye view (Furnas (1986)) of central Washington, D.C., 
using a hyperbolic scale, so that resolution is greatest at the center and 
decreases as 1/distance. The map is dynamic, so that moving the cursor 
changes the focal point of highest resolution, This has the property that it 
allows the viewer to see the context surrounding the point of focus, yet nav- 
igate smoothly throughout the entire space. Similar ideas have been applied 
to tables in the Table Lens (http://www.tablelens.com) and hierarchies 
(Lamping et al. (1995)) such as web sites and file systems, and can easily be 
used for a ID timeline. 
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Fig. 11. Fisheye view of central Washington, D.C., illustrating a hyperbolic view 



References 



BECKER, R. A. (1994): A brief history of S. In: P. Dirschedl and R. Ostermann 
(Eds.): Computational Statistics. Physica, Heidelberg, 81-110. 

BENIGER, J. R. and ROBYN, D. L. (1978): Quantitative graphics in statistics: A 
brief history. The American Statistician, 32, 1-11. 

BERTIN, J. (1967): Semiologie Graphique: Les diagrammes, les reseaux, les cartes. 
Gauthier-Villars, Paris. 

BERTIN, J. (1983): Semiology of Graphics. University of Wisconsin Press, Madison, 
WI. (trans. W. Berg). 

FAREBROTHER, R. W. (1999): Fitting Linear Relationships: A History of the 
Calculus of Observations. Springer, New York, 1750-1900. 

FRIENDLY, M. (1994): Mosaic displays for multi-way contingency tables. Journal 
of the American Statistical Association, 89, 190-200. 

FRIENDLY, M. (2000): Re-Visions of Minard. Statistical Computing & Statistical 
Graphics Newsletter, 11/1, 1, 13-19. 

FRIENDLY, M. and DENIS, D. (2000): The roots and branches of statistical graph- 
ics. Journal de la Societe Frangaise de Statistique, lfl/4, 51-60. (published in 
2001 ). 

FRIENDLY, M. and DENIS, D. (2004): The early origins and development of the 
scatterplot. Journal of the History of the Behavioral Sciences. (In press, ac- 
cepted 7/09/04). 



Milestones in the History of Data Visualization 



51 



FRIENDLY, M. and KWAN, E. (2003): Effect ordering for data displays. Compu- 
tational Statistics and Data Analysis, 43/4 , 509-539. 

FRIIS, H. R. (1974): Statistical cartography in the United States prior to 1870 
and the role of Joseph C. G. Kennedy and the U.S. Census Office. American 
Cartographer, 1, 131-157. 

FUNKHOUSER, H. G. (1936): A note on a tenth century graph. Osiris, 1, 260-262. 

FUNKHOUSER, H. G. (1937): Historical development of the graphical represen- 
tation of statistical data. Osiris, 3/1, 269-405. Reprinted St. Catherine Press, 
Brugge, 1937. 

FURNAS, G. W. (1986): Generalized fisheye views. In: Proceedings of the ACM 
CHI ’86 Conference on Human Factors in Computing Systems. ACM, Boston, 
MA, 16-23. 

HALD, A. (1990): A History of Probability and Statistics and their Application 
before 1750. John Wiley & Sons, New York. 

HANKINS, T. L. (1999): Blood, dirt, and nomograms: A particular history of 
graphs. Isis, 90, 50-80. 

HEISER, W. J. (2000): Early roots of statistical modelling. In: J. Blasius, J. Hox, 
E. DE Leeuw, and P. Schmidt (Eds.): Social Science Methodology in the New 
Millenium: Proceedings of the Fifth International Conference on Logic and 
Methodology. TT-Publikaties, Amsterdam. 

HOFF, H. E. and GEDDES, L. A. (1959): Graphic recording before Carl Ludwig: 
An historical summary. Archives Internationales d’Histoire des Sciences, 12, 
3-25. 

HOFF, H. E. and GEDDES, L. A. (1962): The beginnings of graphic recording. 
Isis, 53, 287-324. Pt. 3. 

KRUSKAL, W. (1977): Visions of maps and graphs. In: Proceedings of the Interna- 
tional Symposium on Computer- Assisted Cartography, Auto-Carto II. 27-36. 

LAMPING, J., RAO, R. and PIROLLI, P. (1995): A focus+context technique based 
on hyperbolic geometry for visualizing large hierarchies. In: Proceedings of the 
SIGCHI conference on Human factors in computing systems. ACM, 401-408. 

MERTON, R. K. (1973): Sociology of Science: Theoretical and Empirical Investi- 
gations. University of Chicago Press, Chicago, IL. 

MINARD, C. J. (1844): Tableaux hguratifs de la circulation de quelques chemins 
de fer. lith. (n.s.). ENPC: 5860/C351, 5299/C307. 

PALSKY, G. (1996): Des Chiffres et des Cartes: Naissance et developpement de la 
Cartographic Quantitative Frangais au XIX s siecle. CHTS, Paris. 

PEARSON, E. S., ed. (1978): The History of Statistics in the 17th and 18th Cen- 
turies Against the Changing Background of Intellectual, Scientific and Reli- 
geous Thought. Griffin & Co. Ltd, London. Lectures by Karl Pearson given at 
University College London during the academic sessions 1921-1933. 

PLAYFAIR, W. (1786): Commercial and Political Atlas: Representing, by Copper- 
Plate Charts, the Progress of the Commerce, Revenues, Expenditure, and Debts 
of England, during the Whole of the Eighteenth Century. Corry, London. 3rd 
edition, Stockdale, London, 1801; French edition, Tableaux d’arithmetique 
lineaire, du commerce, des finances, et de la dette nationale de l’Angleterre 
(Chez Barrois l’Aine, Paris, 1789). 

PLAYFAIR, W. (1801): Statistical Breviary; Shewing, on a Principle Entirely New, 
the Resources of Every State and Kingdom in Europe. Wallis, London. 



52 Friendly 



PLAYFAIR, W. (1821): Letter on our agricultural distresses, their causes and reme- 
dies; accompanied with tables and copperplate charts shewing and comparing 
the prices of wheat, bread and labour, from 1565 to 1821. 

PORTER, T. M. (1986): The Rise of Statistical Thinking 1820-1900. Princeton 
University Press, Princeton, NJ. 

PRIESTLEY, J. (1765): A chart of biography. London. 

RIDDELL, R. C. (1980): Parameter disposition in pre-Newtonain planetary theo- 
ries. Archives Hist. Exact Sci.. 23, 87-157. 

ROBINSON, A. H. (1982): Early Thematic Mapping in the History of Cartography. 
University of Chicago Press, Chicago. 

ROYSTON, E. (1970): Studies in the history of probability and statistics, III. a 
note on the history of the graphical presentation of data. Biometrika, 241- 
247. 43, Pts. 3 and 4 (December 1956); reprinted In: E. S. Pearson and M. 
G. Kendall (Eds.): Studies in the History Of Statistics and Probability Theory. 
Griffin, London. 

STIGLER, S. M. (1980): Stigler’s law of eponomy. Transactions of the New York 
Academy of Sciences, 39, 147-157. 

STIGLER, S. M. (1986): The History of Statistics: The Measurement of Uncertainty 
before 1900. Harvard University Press, Cambridge, MA. 

TILLING, L. (1975): Early experimental graphs. British Journal for the History of 
Science, 8, 193-213. 

TUFTE, E. R. (1983): The Visual Display of Quantitative Information. Graphics 
Press, Cheshire, CT. 

TUFTE, E. R. (1990): Envisioning Information. Graphics Press, Cheshire, CT. 

TUFTE, E. R. (1997): Visual Explanations. Graphics Press, Cheshire, CT. 

WALLIS, H. M. and ROBINSON, A. H. (1987): Cartographical Innovations: An In- 
ternational Handbook of Mapping Terms to 1900. Map Collector Publications, 
Tring, Herts. 

WHEELER, J. A. (1982): Bohr, Einstein, and the strange lesson of the quantum. 
In: R. Q. Elvee (Ed.): Mind in Nature. Harper and Row, San Francisco. 



Quantitative Text Typology: 
The Impact of Word Length 



Peter Grzybek 1 , Ernst Stadlober 2 , Emmerich Kelih 1 , and Gordana Antic 2 

1 Department for Slavic Studies, University Graz, A-8010 Graz, Austria 

2 Department for Statistics, Graz University of Technology, A-8010 Graz, Austria 



Abstract. The present study aims at the quantitative classification of texts and 
text types. By way of a case study, 398 Slovenian texts from different genres and 
authors are analyzed as to their word length. It is shown that word length is an 
important factor in the synergetic self-regulation of texts and text types, and that 
word length may significantly contribute to a new typology of discourse types. 2 



1 Introduction: Structuring the universe of texts 

Theoretically speaking, we assume that there is a universe of texts represent- 
ing an open (or closed) system, i.e. an infinite (or finite) number of textual 
objects. The structure of this universe can be described by two processes: 
identification of its objects, based on a definition of ‘text’, and classification 
of these objects, resulting in the identification and description of hierarchi- 
cally ordered sub-systems. To pursue the astronomic metaphor, the textual 
universe will be divided into particular galaxies, serving as attractors of indi- 
vidual objects. Finally, within such galaxies, particular sub-systems of lower 
levels will be identified, comparable to, e.g., stellar or solar systems. The 
two processes of identification and classification cannot be realized without 
recourse to theoretical assumptions as to the obligatory and/or facultative 
characteristics of the objects under study: neither quantitative nor qualita- 
tive characteristics are immanent to the objects; rather, they are the result 
of analytical cognitive processes. 

1.1 Classification and quantification 

To one degree or another, any kind of classification involves quantification: 
Even in seemingly qualitative approaches, quantitative arguments come into 
play, albeit possibly only claiming - implicitly or explicitly - that some ob- 
jects are ‘more’ or ‘less’ similar or close to each other, or to some alleged 
norm or prototype. The degree of quantification is governed by the traits 
incorporated into the meta-language. Hence it is of relevance on which ana- 
lytical level the process of classification is started. Note that each level has 
its own problems as to the definition of sub-systems and their boundaries. 

2 This study is related to research project #15485 (Word Length Frequencies in 
Slavic Texts), supported by the Austrian Research Fund (FWF). 
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In any case, a classification of the textual universe cannot be achieved 
without empirical research. Here, it is important to note that the under- 
standing of empirical work is quite different in different disciplines, be they 
concerned with linguistic objects or not. Also, the proportion of theory and 
practice, the weighting of qualitative and quantitative arguments, may sig- 
nificantly differ. Disciplines traditionally concentrating on language tend to 
favor theoretical and qualitative approaches; aside from these approaches, 
corpus linguistics as a specific linguistic sub-discipline has a predominant 
empirical component. Defining itself as “data-oriented”, the basic assump- 
tion of corpus linguistics is that a maximization of the data basis will result 
in an increasingly appropriate (“representative”) language description. Ulti- 
mately, none of these disciplines - be they of predominantly theoretical or 
empirical orientation - can work without quantitative methods. 

Here, quantitative linguistics comes into play as an important discipline 
in its own right: as opposed to the approaches described above, quantita- 
tive linguistics strives for the detection of regularities and connections in the 
language system, aiming at an empirically based theory of language. The 
transformation of observed linguistic data into quantities (i.e. , variables and 
constants), is understood as a standardized approach to observation. Specific 
hypotheses are statistically tested and, ideally speaking, the final interpreta- 
tion of the results obtained is integrated into a theoretical framework. 



1.2 Quantitative text analysis: From a definition of the basics 
towards data homogeneity 

The present attempt follows these lines, striving for a quantitative text ty- 
pology. As compared to corpus linguistics, this approach - which may be 
termed quantitative text analysis - is characterized by two major lines of 
thinking: apart from the predominantly theoretical orientation, the assump- 
tion of quantitative text analysis is that ‘text’ is the relevant analytical unit at 
the basis of the present analysis. Since corpus linguistics aims at the construc- 
tion, or re-construction, of particular norms, of “representative” standards, of 
(a given) language, corpus-oriented analyses are usually based on a mixture of 
heterogeneous texts, of a “quasi text”, in a way (Orlov (1982)). On contrast, 
quantitative text analysis focuses on texts as homogeneous entities. The basic 
assumption is that a (complete) text is a self-regulating system, ruled by par- 
ticular regularities. These regularities need not necessarily be present in text 
segments, and they are likely to intermingle in any kind of text combination. 
Quite logically, the question remains, what a ‘text’ is: is it a complete novel, 
composed of books?, or the complete book of a novel, consisting of several 
chapters?, or each individual chapter of a given book?, or perhaps even a 
paragraph, or a dialogical or narrative sequence within it? Ultimately, there 
is no clear definition in text scholarship, and questions whether we need a 
“new” definition of text, regularly re-occur in relevant discussions. Of course, 
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this theoretical question goes beyond the scope of this paper. From a statisti- 
cal point of view, we are faced with two major problems: the problem of data 
homogeneity, and the problem of the basic analytical units. Thus, particular 
decisions have to be made as to the boundary conditions of our study: 

> We consider a ‘text’ to be the result of a homogeneous process of text 
generation. Therefore, we concentrate on letters, or newspaper comments, 
or on chapters of novels, as individual texts. Assuming that such a ‘text’ is 
governed by synergetic processes, these processes can and must be quan- 
titatively described. The descriptive models obtained for each ‘text’ can 
be compared to each other, possibly resulting in one or more general 
model(s); thus, a quantitative typology of texts can be obtained. 

> But even with a particular definition of ‘text’, it has to be decided which 
of their traits are to be submitted to quantitative analyses. Here, we 
concentrate on word length , as one particular linguistic trait of a text. 



1.3 Word length in a synergetic context 

Word length is, of course, only one linguistic trait of texts, among others, and 
one would not expect a coherent text typology, based on word length only. 
However, the criterion of word length is not an arbitrarily chosen factor (cf. 
Grzybek (2004)). First, experience has shown that genre is a crucial factor 
influencing word length (Grzybek and Kelih (2004); Kelih et al., this volume); 
this observation may as well turned into the question to what degree word 
length studies may contribute to a quantitative typology of texts. And second, 
word length is an important factor in a synergetic approach to language 
and text. We cannot discuss the synergetics of language in detail, here (cf. 
Kohler (1986)); yet, it should be made clear that word length is no isolated 
linguistic phenomenon: given one accepts the distinction of linguistic levels, as 
(1) phoneme/grapheme, (2) sy liable/ morpheme, (3) word/lexeme, (4) clause, 
and (5) sentence, at least the first three levels are concerned with recurrent 
units. Consequently, on each of these levels, the re-occurrence of units results 
in particular frequencies, which may be modelled with recourse to specific 
frequency distribution models. Both the units and their frequencies are closely 
related to each other. The units of all five levels are characterized by length, 
again mutually influencing each other, resulting in specific frequency length 
distributions. Table 1 demonstrates the interrelations. 

Finally, in addition to the decisions made, it remains to be decided which 
shall be the analytical units, that is not only what a ‘word’ is (a graphemic, 
phonetic, phonological, intonational, etc. unit), but also in which units word 
length is supposed to be measured (number of letters, of graphemes, of 
phonemes, syllables, morphemes, etc.). 

> In the present analysis, we concentrate on word as an orthographic- 
phonemic category (cf. Antic et al. (2004)), measuring word length as 
the number of syllables per word. 
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Table 1 . Word length in a synergetic circuit 





SENTENCE 


Length 

I 

Length 

*1 I 


Frequency 




CLAUSE 


Frequency 


Frequency 

I ' 1* 


WORD / LEXEME 


Length 

*1 I 


Frequency 


Frequency 

I I* 


SYLLABLE / MORPHEME 


Length 

*1 I 


Frequency 


Frequency 


PHONEME / GRAPHEME 


Length 


Frequency 



1.4 Qualitative and quantitative classifications: 

A priori and a posteriori 

Given these definitions, we can now pursue our basic question as to a quanti- 
tative text typology. As mentioned above, the quantitative aspect of classifica- 
tion is often neglected or even ignored in qualitative approaches. As opposed 
to this, qualitative categories play an overtly accepted role in quantitative 
approaches, though the direction of analysis may be different: 

1. One may favor a “tabula rasa ” principle not attributing any qualitative 
characteristics in advance; the universe of texts is structured according to 
word length only, e.g. by clustering methods, by analyzing the parameters 
of frequency distributions, etc.; 

2. One may prefer an a priori <-> a posteriori principle: in this case, a partic- 
ular qualitative characteristic is attributed to each text, and then, e.g. by 
discriminant analysis, one tests whether these categorizations correspond 
to the quantitative results obtained. 

Applying qualitative categories, the problem of data heterogeneity once 
again comes into play, now depending on the meta-language chosen. In order 
to understand the problem, let us suppose, we want to attribute a category 
such as ‘text type’ to each text. In a qualitative approach, the text universe 
is structured with regard to external (pragmatic) factors - ’’with reference to 
the world” . The categories usually are based either on general communicative 
functions of language (resulting in particular functional styles) or on specific 
situational functions (resulting in specific text sorts). 

(a) The concept of functional style, successfully applied in previous quanti- 
tative research (cf. Mistrik (1966)), has been mainly developed in Russian 
and Czechoslovak stylistics, understanding style as serving particular socio- 
communicative functions. A functional style thus relates to particular dis- 
course spheres, such as everyday, official-administrative, scientific, jour- 
nalistic, or artistic communication. Such a coarse categorization with 
about only half a dozen of categories necessarily results in an extreme 
heterogeneity of the texts included in the individual categories. 
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(b) Contemporary text sort research (cf. Adamczik (1995), 255ff.) distin- 
guishes ca. 4,000 categories. In this case, the categories are less broad 
and general, the material included tends to be more homogeneous, but 
the number of categories can hardly be handled in empirical research. 

In order to profit from the advantages of both approaches, it seems rea- 
sonable to combine these two principles (cf. Grzybek and Kelih (2004)): each 
text sort thus tentatively is attributed to a functional style (cf. Figure 1), the 
attribution being understood as a more or less subjective a priori classifica- 
tion. Thus, in the subsequent quantitative analysis, both bottom-up (text — > 
text sort — > functional style) and top-down analyses are possible in a vertical 
perspective, as well as first order and second order cross-comparisons, in a 
horizontal perspective (i.e., between different functional styles or text sorts). 
Our basic assumption is that the highest level - the entities of which are 




Fig. 1 . Functional styles and text sorts 



comparable to ‘text galaxies’ (see above) - should not primarily considered 
to be defined by socio-communicative functions, but regarded as linguistic 
phenomena: It seems reasonable to assume that different text sorts (analo- 
gous to our “stellar systems”), which serve particular functions as well, should 
be characterized by similar linguistic or stylistic traits. As opposed to merely 
qualitative text typologies, the attribution of text sorts to functional styles 
is to be understood as an a priori hypothesis, to be submitted to empirical 
tests. As a result, it is likely that either the a priori attributions have to be 
modified, or that other categories have to be defined at the top level, e.g. 
specific discourse types , instead of functional styles. 



2 A case study: Classifying 398 Slovenian texts 

The present case study is an attempt to arrive at a classification of 398 
Slovenian texts, belonging to various sorts, largely representing the spectrum 
of functional styles; the sample is characterized in Table 2. The emphasis 



58 



Grzybek et al. 



Table 2. 398 Slovenian texts 



FUNCTIONAL STYLE 


AUTHOR(S) 


TEXT TYPE(S) 


no. 


□ Everyday 


Cankar, Jurcic 


Private Letters 


61 


□ Public 


various 


Open Letters 


29 


□ Journalistic 

□ Artistic 


various 


Readers’ Letters, Comments 


65 


© Prose 


Cankar 


Individual Chapters from 
Short Novels ( povest ) 


68 




Svigelj-Merat / 
Kolsek 


Letters from an 
Epistolary Novel 


93 


© Poetry 


Gregorcic 


Versified Poems 


40 


© Drama 


Jancar 


Individual Acts 
from Dramas 


42 



on different types of letters is motivated by the fact that ‘letter’ as a genre 
often is regarded to be prototypical of (a given) language in general, since 
a ‘letter’ is assumed to be located between oral and written communication, 
and considered as the result of a unified, homogeneous process of text gen- 
eration. This assumption is problematic, however, if one takes into account 
the fact that contemporary text sort research (cf. Adamczik (1995), 255ff.) 
distinguishes several dozens of different letter types. Consequently, it would 
be of utmost importance (i) to compare how the genre of letters as a whole 
relates to other genres, and (ii) to see how different letter types relate to 
each other - in fact, any difference would weaken the argument of the letter’s 
prototypicality. 

In our analyses, each text is analyzed with regard to word length, the mean 
(mi) being only one variable characterizing a given frequency distribution. 
In fact, there is a pool of ca. 30 variables at our disposal, including the four 
central moments, variance and standard deviation, coefficient of variation, 
dispersion index, entropy, repeat rate, etc. These variables are derived from 
the word length frequencies of a given text; Figure 2 examplarily represents 
the relative frequencies of T-syllable words for two arbitrarily chosen texts. In 
this case, there are significant differences between almost all length classes. 




Fig. 2. Word length frequencies (in %) of two different texts (Left: Comment 
(#324). Right: Private letter (#1)) 
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2.1 Post hoc analysis of mean word length 

By way of a first approximation, it seems reasonable to calculate a post-hoc- 
analysis of the mean values. As a result of this procedure, groups without 
significant differences form homogeneous subgroups, whereas differing groups 
are placed in different groups. As can be seen from Table 3, which is based on 
mean word length (mi) only, homogeneous subgroups do in fact exist; even 
more importantly, however, all four letter types fall into different categories. 
This observation gives rise to doubt the assumption, that ‘letter’ as a category 
can serve as a prototype of language without further distinction. 



Table 3. Post hoc analyses (mi) 



Text sort 


n 


Subgroup for a = .05 






1 2 3 


4 


5 


Poems 


40 


1.7127 






Short stories 


68 


1.8258 






Private letters 


61 


1.8798 






Drama 


42 


1.8973 






Epistolary novel 


93 


2.0026 






Readers’ letters 


30 




2.2622 




Comments 


35 




2.2883 




Open Letters 


29 






2.4268 



2.2 Discriminant analyses: The whole corpus 

In linear discriminant analyses, specific variables are submitted to linear 
transformations in order to arrive at an optimal discrimination of the in- 
dividual cases. At first glance, many variables of our pool may be important 
for discrimination, where the individual texts are attributed to groups, on the 
basis of these variables. However, most of the variables are redundant due to 
their correlation structure. The stepwise procedures in our analyses resulted 
in at most four relevant predictor variables for the discriminant functions. 
Figure 3 shows the results of the discriminant analysis for all eight text sorts, 
based on four variables: mean word length (mi), variance (m 2 ), coefficient 
of variation (v = s/mi), and relative frequency of one-syllable words (pi). 
With only 56.30% of all texts being correctly discriminated, some general 
tendencies can be observed: (1) although some text sorts are located in clearly 
defined areas, there are many overlappings; (2) poems seem to be a separate 
category, as well as readers’ letters, open letters, and comments, on the other 
end; (3) drama, short story, private letters and the letters from the epistolary 
novel seem to represent some vaguely defined common area. 
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Fig. 3. Discriminant analysis: Eight text sorts 



2.3 From four to two letter types 

In a first approach to explore the underlying structure of the textual universe, 
we concentrate on the four letter types, only, since they were all attributed to 
different classes in the post hoc analyses. Treating all of them - i.e., private 
letters (PL), open letters ( OL ), readers’ letters ( RL ), and letters from an 
epistolary novel (EN) -, as separate classes, a percentage of 70.40% correctly 
discriminated texts is obtained, with only two relevant variables: mi and v. 
There is an obvious tendency that private letters (PL) and the letters from 



Table 4. Discriminant analysis: Four letter types (n = 213) 



Predicted group 



Letter Type 


PL 


OL 


RL 


EN 


Total 


PL 


37 


0 


2 


22 


61 


OL 


0 


22 


3 


4 


29 


RL 


1 


9 


10 


10 


30 


EN 


10 


0 


3 


80 


93 



the epistolary novel (EN) represent a common category, whereas open letters 
(OL) and readers’s letters (RL) display this tendency to a lesser degree, if 
at all. Combining private letters and the letters from the epistolary novel in 
one group, thus discriminating three classes of letters, yields a percentage of 
86.90% correctly discriminated texts, with only two variables: mi and P 2 (i.e., 
the percentage of two-syllable words) . Table 5 shows the results in detail: 98% 
of the combined group are correctly discriminated. This is a strong argument 
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Table 5. Discriminant analysis: Three letter types ( n = 213) 





Predicted 


group 




Group 


1 


2 


3 


Total 


1 


151 


0 


3 


154 


2 


2 


20 


6 


28 


3 


12 


5 


14 


31 


1 ={PL,EN} 


2= 


OL 3= 


=RL 



in favor of the assumption that we are concerned with some common group of 
private letters, be they literary or not. This result sheds serious doubt on the 
possibility to distinguish fictional literary letters: obviously, they reproduce 
or “imitate” the linguistic style of private letters, what generally calls into 
question the functional style of prosaic literature. Given this observation, it 
seems reasonable to combine readers’ letters ( RL ) and open letters iOL) in 
one common group, too, and to juxtapose this group of public letters to the 
group of private letters. In fact, this results in a high percentage of 92.00%, 
with m i and P 2 being the relevant variables. 

2.4 Towards a new typology 

On the basis of these findings, the question arises if the two major groups - 
private letters ( PL/EN ) and public letters ( OL/RL ) - are a special case 
of more general categories, such as, e.g., ‘private/everyday style’ and ‘pub- 
lic/official style’. If this assumption should be confirmed, the re-introduction 
of previously eliminated text sorts should yield positive results. 

The re-introduction of journalistic comments (CO) to the group of 
public texts does not, in fact, result in a decrease of the good discrimination 
result: as Table 6 shows, 91.10% of the 248 texts are correctly discriminated 
(again, with m i and P 2 , only). Obviously, some distinction along the line of 
public/official vs. private/everyday texts seems to be relevant. 



Table 6. Discriminant analysis: Five text sorts in two categories: Public/ Official 
vs. Private/Everyday ( n = 248) 





Predicted 


group 




Group 


1 


2 


Total 


1 


148 


6 


154 


2 


16 


78 


94 



1 ={PL,EN} 2 ={OL,RL,CO} 
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The re-introduction of the dramatic texts (DR), as well, seems to be 
a logical consequence, regarding them as the literary pendant of everyday 
dialogue. We thus have 290 texts, originating from six different text sorts, 
and grouped in two major classes; as Table 7 shows, 92.40% of the texts 
are correctly discriminated. One might object, now, that the consideration 
of only two classes is likely to be effective. Yet, it is a remarkable result that 
the addition of two non-letter text sorts does not result in a decrease of the 
previous result. 



Table 7. Discriminant analysis: Six text sorts in two categories: Public/ Official vs. 
Private/Everyday (n = 290) 





Predicted 


group 




Group 


1 


2 


Total 


1 


190 


6 


196 


2 


16 


78 


94 



1 ={PL, EN, DR} 2 ={OL, RL, CO} 



The re-introduction of the poetic texts (PO) as a category in its own 
right, results in three text classes. Interestingly enough, under these circum- 
stances, too, the result is not worse: rather, a percentage of 91.20% correct 
discriminations is obtained on the basis of only three variables: m\,p 2 , v. The 
results are represented in detail, in Table 8. 



Table 8. Discriminant analysis: Seven text sorts in three categories: Public/Official 
vs. Private/Everyday vs. Poetry (n = 330) 





Predicted 


group 






Group 


1 


2 


3 


Total 


1 


191 


3 


2 


196 


2 


19 


75 


0 


94 


3 


5 


0 


35 


40 



1 ={PL,EN,DR} 2 ={OL,RL,CO} 3 ={PO} 



It can clearly be seen that the poetic texts represent a separate category 
and imply almost no mis-classifications. At this point, the obvious question 
arises if a new typology might be the result of our quantitative classification. 
With this perspective in mind, it should be noticed that seven of our eight 
text sorts are analyzed in Table 8. 
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The re-introduction of the literary prose texts ( LP ) is the last step, 
thus again arriving at the initial number of eight text sorts. As can be seen 
from Table 9, the percentage of correctly discriminated texts now decreases 
to 79.90%. 

Table 9. Discriminant analysis: Eight text sorts in four categories (n = 398) 



Group 


Predicted 
1 2 


group 
3 4 


Total 


1 


183 


3 


9 


1 


196 


2 


19 


75 


0 


0 


94 


3 


42 


0 


26 


0 


68 


4 


1 


0 


5 


34 


40 



1 ={PL,EN,DR} 2 ={OL,RL,CO} 
3 ={LP} 4 ={PO} 



A closer analysis shows that the most mis-classifications appear between 
literary texts and private letters. Interestingly enough, many of these texts 
are from one and the same author (Ivan Cankar) . One might therefore suspect 
authorship to be an important factor; however, Kelih et al. (this volume) have 
good arguments (and convincing empirical evidence) that word length is less 
dependent on authorship, than it is on genre. As an alternative interpretation, 
the reason may well be a specific for the analyzed material because in case 
of the literary texts, we are concerned with short stories which aim at the 
imitation of orality, and include dialogues to varying degree. 

Therefore, including the literary prose texts (LP) in the group of inoffi- 
cial/oral texts, and separating them from the official/written group, on the 
one hand, and the poetry group, on the other, results in a percentage of 
92.70% correctly discriminated texts, as can be seen from Table 10. The final 
outcome of our classification is represented in Figure 4. 



Table 10. Discriminant analysis: Eight text sorts in three categories: Inofficial / 
Oral vs. Official / Written vs. Poetry (n = 398) 





Predicted 


group 




Group 


1 


2 


3 


Total 


1 


260 


3 


1 


264 


2 


19 


75 


0 


94 


3 


6 


0 


34 


40 



1 ={PL, EN, DR, LP} 2 ={OL, RL, CO} 
3 ={PO} 
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4 
2 
0 
-2 
-4 

Fig. 4. Discriminant analysis: Final results and new categorization 




-6 -4 -2 



2.5 Conclusion 

The results suggest the existence of specific discourse types, which do not 
coincide with traditional functional styles. Future research must concentrate 
on possible additional discourse types and their relation to text sorts. 



References 

ADAMCZIK, Kirsten (1995): Textsorten - Texttypologie. Eine kommentierte Biblio- 
graphie. Nodus, Munster. 

ANTIC, G., KELIH, E., and GRZYBEK, P. (2004): Zero-syllable Words in Deter- 
mining Word Length. In: P. Grzybek ( Ed.): Contributions to the Science of 
Language. Word Length Studies and Related Issues. [In print] 

GRZYBEK, P. (2004): History and Methodology of Word Length Studies: The State 
of the Art. In: P. Grzybek (Ed.): Contributions to the Science of Language: 
Word Length Studies and Related Issues. [In print] 

GRZYBEK, P. and KELIH, E. (2004): Texttypologie in/aus empirischer Sicht. In: 
J. Bernard, P. Grzybek and Ju. Fikfak (Eds.): Text and Reality. Ljubljana. [In 
print] . 

GRZYBEK, P. and STADLOBER, E. (2003): Zur Prosa Karel Capeks - Einige 
quantitative Bemerkungen. In: S. Kempgen, U. Schweier and T. Berger (Eds.), 
Rusistika - Slavistika - Lingvistika. Festschrift fur Werner Lehfeldt zum 60. 
Geburtstag. Sagner, Miinchen, 474-488. 

KELIH, E., ANTIC, G., GRZYBEK, P., and STADLOBER, E. (2004): Classifica- 
tion of Author and/or Genre? [Cf. this volume] 

KOHLER, R. (1986): Zur synergetischen Linguistik: Struktur und Dynamik der 
Lexik. Brockmeyer, Bochum. 

ORLOV, Ju.K. (1982): Linguostatistik: Aufstcllung von Sprachnormen oder 
Analyse des Redeprozesses? (Die Antinomie “Sprache-Rede” in der statistis- 
chen Linguistik). In: Ju.K. Orlov and M.G. Boroda, I.S. Nadaresvili: Sprache, 
Text, Kunst. Quantitative Analysen. Brockmeyer, Bochum. 



Cluster Ensembles 



Kurt Hornik 

Institut fur Statistik und Mathematik, 
Wirtschaftsuniversitat Wien, 

Augasse 2-6, A-1090 Wien, Austria 



Abstract. Cluster ensembles are collections of individual solutions to a given clus- 
tering problem which are useful or necessary to consider in a wide range of appli- 
cations. Aggregating these to a “common” solution amounts to finding a consensus 
clustering, which can be characterized in a general optimization framework. We dis- 
cuss recent conceptual and computational advances in this area, and indicate how 
these can be used for analyzing the structure in cluster ensembles by clustering its 
elements. 



1 Introduction 

Ensemble methods create solutions to learning problems by constructing a set 
of individual (different) solutions ( “base learners” ) , and subsequently suitably 
aggregating these, e.g., by weighted averaging of the predictions in regression, 
or by taking a weighted vote on the predictions in classification. Such meth- 
ods, which include Bayesian model averaging (Hoeting et al. (1999)), bagging 
(Breiman (1996)) and boosting (Friedman et al. (2000)) have already become 
very popular for supervised learning problems (Dietterich (2002)). 

In general, aggregation yields algorithms with “low variance” in the sta- 
tistical learning sense so that the results obtained by aggregation are more 
“structurally stable” . Based on the success and popularity of ensemble meth- 
ods, the statistical and machine learning communities have recently also be- 
come interested in employing these in unsupervised learning tasks, such as 
clustering. (Note that in these communities, the term “classification” is used 
for discriminant analysis. To avoid ambiguities, we will use “supervised classi- 
fication” to refer to these learning problems.) For example, a promising idea is 
to obtain more stable partitions of a given data set using bagging (Bootstrap 
Aggregating), i.e., by training the same base clusterer on bootstrap samples 
from the data set and then finding a “majority decision” from the labelings 
thus obtained. But obviously, aggregation is not as straightforward as in the 
supervised classification framework, as these labelings are only unique up 
to permutations and therefore not necessarily matched. In the classification 
community, such aggregation problems have been studied for quite some time 
now. A special issue of the Journal of Classification was devoted to “Compar- 
ison and Consensus of Classifications” (Day (1986)) almost two decades ago. 
By building on the readily available optimization framework for obtaining 
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consensus clusterings it is possible to exploit the full potential of the cluster 
ensemble approach. 

Employing cluster ensembles can be attractive or even necessary for sev- 
eral reasons, the main ones being as follows (see e.g. Strehl and Ghosh (2002)): 

• To improve quality and robustness of the results. Bagging is one ap- 
proach to reduce variability via resampling or reweighting of the data, 
and is used in Leisch (1999) and Dudoit and Fridlyand (2002). In addi- 
tion, many clustering algorithms are sensitive to random initializations, 
choice of hyper-parameters, or the order of data presentation in on-line 
learning scenarios. An obvious idea for possibly eliminating such algo- 
rithmic variability is to construct an ensemble with (randomly) varied 
characteristics of the base algorithm. This idea of “sampling from the 
algorithm” is used in Dimitriadou et al. (2001, 2002). Aggregation can 
also leverage performance in the sense of turning weak into strong learn- 
ers; both Leisch (1999) and Dimitriadou et al. (2002) illustrate how e.g. 
suitable aggregation of base fc-means results can reveal underlying non- 
convex structure which cannot be found by the base algorithm. Other 
possible strategies include varying the “features” used for clustering (e.g., 
using various preprocessing schemes) , and constructing “meta-clusterers” 
which combine the results of the application of different base algorithms 
as an attempt to reduce dependency of results on specific methods, and 
take advantage of today’s overwhelming method pluralism. 

• To aggregate results over conditioning/grouping variables in situations 
where repeated measurements of features on objects are available for 
several levels of a grouping variable, such as the 3-way layout in Vichi 
(1999) where the grouping levels correspond to different time points at 
which observations are made. 

• To reuse existing knowledge. In applications, it may be desired to reuse 
legacy clusterings in order to improve or combine these. Typically, in 
such situations only the cluster labels are available, but not the original 
features or algorithms. 

• To accommodate the needs of distributed computing. In many applica- 
tions, it is not possible to use all data simultaneously. Data may not nec- 
essarily be available in a single location, or computational resources may 
be insufficient to use a base clusterer on the whole data set. More gener- 
ally, clusterers can have access to either a subset of the objects ( “object- 
distributed clustering”) or the features (“feature-distributed clustering”), 
or both. 

In all these situations, aggregating (subsets of) the cluster ensemble by 
finding “good” consensus clusterings is fundamental. In Section 2, we consider 
a general optimization framework for finding consensus partitions. Extensions 
are discussed in Section 3. 
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2 Consensus partitions 

There are three main approaches to obtaining consensus clusterings (Gor- 
don and Vichi (2001)): in the constructive approach, a way of constructing a 
consensus clustering is specified: for example, a strict consensus clustering is 
defined to be one such that objects can only be in the same group in the con- 
sensus partition if they were in the same group in all base partitions. In the 
axiomatic approach, emphasis is on the investigation of existence and unique- 
ness of consensus clusterings characterized axiomatically. The optimization 
approach formalizes the natural idea of describing consensus clusterings as 
the ones which “optimally represent the ensemble” by providing a criterion 
to be optimized over a suitable set C of possible consensus clusterings. Given 
a function d which measures dissimilarity (or distance) between two cluster- 
ings, one can e.g. look for clusterings which minimize average dissimilarity, 
i.e., which solve 

C* = argmin CeC d ( C ’ 

over C. Analogously, given a measure of similarity (or agreement), one can 
look for clusterings maximizing average similarity. Following Gordon and 
Vichi (1998), one could refer to the above C* as the median or medoid clus- 
tering if the optimum is sought over the set of all possible base clusterings, 
or the set {Cj , . . . , Cb} of the base clusterings, respectively. 

When finding consensus partitions , it seems natural to look for optimal 
soft partitions which make it possible to assign objects to several groups with 
varying degrees of “membership” (Gordon and Vichi (2001), Dimitriadou et 
al. (2002)). One can then assess the amount of belongingness of objects to 
groups via standard impurity measures, or the so-called classification mar- 
gin (the difference between the two largest memberships). Note that “soft” 
partitioning includes fuzzy partitioning methods such as the popular fuzzy 
c-means algorithm (Bezdek (1974)) as well as probabilistic methods such as 
the model-based approach of Fraley and Raftery (2002). In addition, one can 
compute global measures of the softness of partitions, and use these to 
extend the above optimization problem to minimizing 

y B u b d(c,c b )+\$(c) 

over all soft partitions, where the u> indicate the importance of the base clus- 
terings (e.g., by assigning importance according to softness of the base par- 
titions), and A controls the amount of “regularization”. This extension also 
allows for a soft-constrained approach to the “simple” problem of optimizing 
over all hard partitions. Of course, one could consider criterion functions re- 
sulting in yet more robust consensus solutions, such as the median or trimmed 
mean of the distances d(C, C b ). 

One should note that the above optimization problems are typically com- 
putationally very hard. Finding an optimal hard partition with I\ labels in 
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general makes it necessary to search all possible hard partitions (the num- 
ber of which is of the order ( K + l) n (Jain and Dubes (1988)) for the op- 
timum. Such exhaustive search is clearly impossible for most applications. 
Local strategies, e.g. by repeating random reassigning until no further im- 
provement is obtained, or Boltzmann-machine type extensions (Strehl and 
Ghosh (2002)) are still expensive and not guaranteed to find the global opti- 
mum. 

Perhaps the most popular similarity measure for partitions of the same 
data set is the Rand index (Rand (1971)) used in e.g. Gordon and Vichi 
(1998), or the Rand index corrected for agreement by chance (Hubert and 
Arabie (1985)) employed by Krieger and Green (1999). Finding (hard) con- 
sensus partitions by maximizing average similarity is NP-hard in both cases. 
Hence, Krieger and Green (1999) propose an algorithm (SEGWAY) based 
on the combination of local search by relabeling single objects together with 
“smart” initialization using random assignment, latent class analysis (LCA), 
multiple correspondence analysis (MCA), or a greedy heuristic. Note also 
that using (dis) similarity measures adjusted for agreement by chance works 
best if the partitions are stochastically independent, which is not necessarily 
the case in all cluster ensemble frameworks described in Section 1. 

In what follows, the following terminology will be useful. Given a data set 
X with the measurements of the same features (variables) on n objects, a K- 
clustering of X assigns to each Xi in X a (sub-)probability TGvector C(xi) = 
(fin, ■ ■ ■ , Hue) (the “membership vector” of the object) with fin, ■ ■ ■ , Hue > 0, 

Vik - !• Formally, 

C : A -> M S M k ; M k = {M G R nxK : M > 0,M1 K < 1 K }, 

where Ik is a lenght I\ column vector of ones, and MIk is the matrix 
product of M and Ik- This framework includes hard partitions (where each 
C(xi) is a unit Cartesian unit vector) and soft ones, as well as incomplete 
(e.g., completely missing, for example if a sample from X was used) results 
where YZk Hik < 1- Permuting the labels (which correspond to the columns 
of the membership matrix M) amounts to replacing M by Mil , where 77 is 
a suitable permutation matrix. 

The dissimilarity measure used in Models I and II of Gordon and Vichi 
(2001) and in Dimitriadou et al. (2002) use the Euclidean dissimilarity of the 
membership matrices, adjusted for optimal matching of the labels. If both 
partitions use the same number of labels, this is given by 

d F (M, M) = min n \\M - M77|| 2 

where the minimum is taken over all permutation matrices 77 and || • || is the 
Frobenius norm (so that ||Y|| 2 = tr (Y'Y), where ' denotes transposition). 
As || M - M77|| 2 - tr (M'M) - 2tr(M / 7H77) + tr (11' M' Mil) = tr (M'M) - 
2tr(M'M77) + tr (M'M), we see that minimizing \\M — M77|| 2 is equivalent 
to maximizing ti(M'MII) = k which for hard partitions is 
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the number of objects with the same label in the partitions given by M 
and Mil. Finding the optimal U is thus recognized as an instance of the 
assignment problem (or weighted bipartite graph matching problem), which 
can be solved by a linear program using the so-called Hungarian method in 
time 0(K 3 ) (e.g., Papadimitriou and Steiglitz (1982)). If the partitions have 
different numbers of labels, matching also includes suitably collapsing the 
labels of the finer partition, see Gordon and Vichi (2001) for details. 

Finding the consensus 77-clustering of given base 77-clusterings with mem- 
bership matrices M \ , . . . , Mb amounts to minimizing Af{M, Mb) over 

Mki and is equivalent to minimizing Ylb= i II Af — Mf,7Tf,|| 2 over M G Mk 
and all permutation matrices TZi, . . . , TIr- . Dimitriadou et al. (2002) show 
that the optimal M is of the form 

M= \ MbIIb 

for suitable permutation matrices 77i,...,77b. A hard partition obtained 
from this consensus partition by assigning objects to the label with maximal 
membership thus performs simple majority voting after relabeling, which 
motivates the name “voting” for the proposed framework. The 17i, ...,17b 
in the above representation are obtained by simultaneously maximizing the 
profile criterion function 



over all possible permutation matrices (of course, one of these can be taken as 
the identity matrix) . This is a special case (but not an instance) of the mul- 
tiple assignment problem, which is known to be NP-complete, and can e.g. 
be approached using randomized parallel algorithms (Oliveira and Pardalos 
(2004)). However, we note that unlike in the general case, the above criterion 
function only contains second-order interaction terms of the permutations. 
Whether the determination of the optimal permutations and hence of the 
consensus clustering is possible in time polynomial in both B and 77 is cur- 
rently not known. 

Based on the characterization of the consensus solution, Dimitriadou et 
al. (2002) suggest a greedy forward aggregation strategy for determining ap- 
proximate solutions. One starts with Mq = M\ and then, for all b from 1 
to B , first determines a locally optimal relabeling fib of Mb to Mb-\ (be., 
solves the assignment problem argmiiiB \\Mb-i — MbII \\ 2 using the Hungar- 
ian method), and determines the optimal M = Mb = (1/&) Y^ 3 =i Mpflp for 
fixed fit , . id b by on-line averaging as Mb = (1 — l/b)Mb-i + ( l/b)Mbflb ■ 
The final Mb is then taken as the approximate consensus clustering. One 
could extend this approach into a fixed-point algorithm which repeats the for- 
ward aggregation, with the order of membership matrices possibly changed, 
until convergence. Gordon and Vichi (2001) propose a different approach 
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which iterates between simultaneously determining the optimal relabelings 
III, : IIb for fixed M by solving the corresponding assignment problems, 
and then optimizing for M for fixed III, . . . , IIb by computing the average 
(l/5)Ef=iM 6 7T 6 . 

In the aggregation strategy Bagl of Dudoit and Fridlyand (2002), the 
same base clusterer is applied to both the original data set and B bootstrap 
samples thereof, giving membership matrices M le f and Mi, . . . , Mb- Optimal 
relabelings IIb are obtained by matching the Mb to M re f, and (a hard version 
of) the consensus partition is then obtained by averaging the Mbllb- There 
seems to be no optimization criterion underlying this constructive approach. 

According to Messatfa (1992), historically the first index of agreement 
between partitions is due to Katz and Powell (1953), and based on the Pear- 
son product moment correlation coefficient of the off-diagonal entries of the 
co-incidence matrices MM ' of the partitions. (Note that the (i, j)-tli element 
of MM' is given by Efc=i VikHjk, which in the case of hard partitions is one 
if objects i and j are in the same group, and zero otherwise, and that relabel- 
ing does not change MM' I) A related dissimilarity measure (using covariance 
rather than correlation) is 

d c [M, M) = || MM' - MM'\\ 2 

The corresponding consensus problem is the minimization of \\MM' — 
MbM' b \\ 2 , or equivalently of 

1 B 2 

MM' - - J2 b=1 M b Ml, 

over Mr- This is Model III of Gordon and Vichi (2001), who suggest to use a 
sequential quadratic programming algorithm (which can only be guaranteed 
to find local minima) for obtaining the optimal M £ A4k- The average co- 
incidence matrix (1 / B) E^=i MbM b also forms the basis of the constructive 
consensus approaches in Fred and Jain (2002) and Strehl and Ghosh (2002). 

3 Extensions 

The optimization approach to finding consensus clusterings is also applicable 
to the case of hierarchical clusterings (Vichi (1999)). If these are represented 
by the corresponding ultra- metric matrices Ui, . . . , Ub, a consensus clustering 
can be obtained e.g. by minimizing Y^ b \W~ £4|| 2 over all possible ultra-metric 
matrices U. 

In many applications of cluster ensembles, interest is not primarily in ob- 
taining a global consensus clustering, but to analyze (dis)similarity patterns 
in the base clusterings in more detail — i.e. , to cluster the clusterings. Gor- 
don and Vichi (1998) present a framework in which all clusterings considered 
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are hard partitions. Obviously, the underlying concept of “clustering cluster- 
ings” , based on suitable (dis) similarity measures between clusterings, such as 
the ones discussed in detail in Section 2, is much more general. In particular, 
it is straightforward to look for hard prototype-based partitions of a cluster 
ensemble characterized by the minimization of 



where e*, is the fc-th Cartesian unit vector over all possible hard assignments 
C of membership matrices to labels and all suitable prototypes Pi, ... , Pk- 
If the usual algorithm which alternates between finding optimal prototypes 
for fixed assignments and reassigning the Mb to their least dissimilar proto- 
type is employed, we see that finding the prototypes amounts to finding the 
appropriate consensus partitions in the groups. Similarly, soft partitions can 
be characterized as the minima of the fuzzy c-means style criterion function 
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Abstract. The two most common component methods for the analysis of three- 
way data, CANDECOMP/PARAFAC (CP) and Tucker3 analysis, are used to sum- 
marize a three-mode three-way data set by means of a number of component matri- 
ces, and, in case of Tucker3, a core array. Until recently, no procedures for computing 
confidence intervals for the results from such analyses were available. Recently, such 
procedures have come available by Riu and Bro (2003) for CP using the jack-knife 
procedure, and by Kiers (2004) for CP and Tucker3 analysis using the bootstrap 
procedure. The present paper reviews the latter procedures, discusses their per- 
formance as reported by Kiers (2004), and illustrates them on an example data 
set. 

1 Introduction 

For the analysis of three-way data sets (e.g., data with scores of a number 
of subjects, on a number of variables, under a number of conditions) various 
exploratory three-way methods are available. The two most common methods 
for the analysis of three-way data are CANDECOMP/PARAFAC (Carroll 
and Chang (1970), Harshman (1970)) and Tucker3 analysis (Tucker (1966), 
Kroonenberg and De Leeuw (1980)). Both methods summarize the data by 
components for all three modes, and for the entities pertaining to each mode 
they yield component loadings; in the case of Tucker3 analysis, in addition, 
a so-called core array is given, which relates the components for all three 
modes to each other. 

If we denote our I x J x K three-way data array (which has usually been 
preprocessed by centering and/or scaling procedures) by X> then the two 
methods can be described as fitting the model 



where a,i P , bj q , and Ckr denote elements of the component matrices A (for the 
first mode, e.g., the subjects), B (for the second mode, e.g., the variables), 



p Q R 




(1) 



p— 1 q— 1 r = 1 



e-mail: : h.a.l.kiers@ppsw.rug.nl 



74 



Kiers 



and C (for the third mode, e.g., the conditions), of orders I x P, J x Q, 
and K x R, respectively; g pqr denotes the element (p, q , r) of the P x Q x R 
core array G, and e^fc denotes the error term for element Xijk ; P, Q, and R 
denote the numbers of components for the three respective modes. Once the 
solution has been obtained, component matrices and/or the core are usually 
rotated to simplify the interpretation (see, Kiers, 1998), without loss of fit. 
CANDECOMP/PARAFAC (CP) differs from Tucker3 analysis in that in CP 
the core is set equal to a superidentity array (i.e., with g pqr = 1 if p = q = r, 
and g pqr = 0 otherwise). As a consequence, in the case of CP, for all modes 
we have the same number of components, and (1) actually reduces to 



Because the CP model is unique up to scaling and permutation, no rotations 
can be used to simplify the interpretation. 

Both models are fitted to sample data by minimizing the sum of squared er- 
rors. Usually, the model that fits optimally to the sample is assumed to be, at 
least to some extent, also valid for the population from which the sample was 
drawn. However, until recently, only global measures for indicating the reli- 
ability of such generalizations from sample to population were available, for 
instance by using cross-validation (e.g, see Kiers and van Mechelen (2001)). 
Recently, however, resampling procedures (see, e.g., Efron and Tibshirani 
(1993)) have been proposed for obtaining confidence intervals for all general- 
izable individual parameters resulting from a three-way component analysis. 
For this purpose, Riu and Bro (2003) proposed a jack-knife procedure for CP, 
and Kiers (2004) proposed various bootstrap procedures for CP and Tucker3 
analysis. Both procedures can be used when the entities in one of the three 
modes can be considered a random sample from a population. According to 
Efron and Tibshirani, in general, the bootstrap can be expected to be more 
efficient than the jack-knife, so here we will focus on the bootstrap rather 
than the jackknife. 

Bootstrap analysis can be applied straightforwardly when solutions are uni- 
quely determined. However, the Tucker3 solution is by no means uniquely 
determined. Kiers (2004) described various procedures for handling this non- 
uniqueness in case of Tucker3. He also studied their performance in terms 
of coverage of the resulting confidence intervals, and in terms of computa- 
tional efficiency by means of a simulation study. The main purpose of the 
present paper is to review these procedures briefly, and to describe how such 
a procedure works in practice in case of the analysis of an empirical data set. 

2 The bootstrap for fully determined solutions 

The basic idea of the bootstrap is to mimic the sampling process that gen- 
erated our actual data sample, as follows. We suppose that the entities in 
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the first mode (e.g., the individuals) are a random sample from a popula- 
tion. Then, with the bootstrap procedure we assess what could happen if we 
would consider our sample as a population, and if we randomly (re)sample 
with replacement from this ’pseudo-population’. In fact, we consider the dis- 
tribution of score profiles in our actual sample as a proxy of the distribution 
of such profiles in our population; by randomly resampling from our sample, 
we mimic the construction of a sampling distribution, on the basis of which 
we intend to make inferences about actual population characteristics. In prac- 
tice, if our three-way data set of order I x J x K is denoted by X, and the 
score profiles for individual i are denoted by X,;, which is a matrix of order 
J x K , then we randomly draw (with replacement) I matrices Xi from the 
set of matrices {Xi, . . . , X/}. This creates one bootstrap sample (in which 
some matrices may occur repeatedly, and others not at all), which is then 
reorganized into a three-way array. This procedure is to be repeated in order 
to obtain, for instance, 500 such bootstrap three-way arrays. (In the sequel, 
500 is taken as the number of bootstrap samples, but this is just meant as an 
example; a higher number will always be better, although the improvement 
may be little). 

Now to each bootstrap three-way array, we apply a three-way component 
method in exactly the same way as we applied it originally to our sample 
(hence including the pre- and postprocessing procedures we used in the analy- 
sis of the sample data), and we compute the statistics we are interested in. 
The statistics of main interest are the loadings and the core values. Let these 
be collected in a single vector 9. The vector of outcomes for our original sam- 
ple is denoted as 9 s , whereas those for the bootstrap samples are denoted 
as 9 h , b = 1, . . . ,500. Now, the variation in the 500 bootstrap sample out- 
come vectors indicates how and how much the outcome vectors vary if we 
randomly resample from our pseudo-population. This is used as an estimate 
of how much real samples from our real population can be expected to vary 
if we would sample repeatedly from our actual population. 

A simple way to describe the variation across the bootstrap sample outcomes 
is to give, for each parameter separately, a percentile interval (e.g., a 95%per- 
centile interval) which describes the range in which we find the middle 95% 
values (out of the total of 500 values) of the parameter at hand. In this way, 
for each loading and each core value we get a 95%percentile interval. Such 
percentile intervals can be interpreted as approximate confidence intervals. 
The above procedure was based on computing the loadings and core values 
in exactly the same way for each bootstrap three-way array. However, this 
requires that the models are completely identified, in some way or another. 
Identification of CP or Tucker3 solutions can be done as follows. One of the 
key features of the CP model is that it is ’essentially’ uniquely identified (see 
Carroll and Chang (1970), Harshman (1970)). By this it is meant that the 
component matrices A, B, and C resulting from a CP analysis are, under 
mild assumptions unique up to a joint permutation of the columns of the 
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three matrices, and up to scaling of the columns of the three matrices. Hence 
a simple procedure to further identify this solution is to scale components 
such that the component matrices for two modes (e.g., the first two modes) 
have unit column sums of squares. A procedure to fix the order of the com- 
ponents is by ordering them such that the column sums of squares decrease. 
Then it still remains to identify the sign of the component matrices. This can 
be done in various ways, that are, however, all rather arbitrary (e.g., ensure 
that the column sums in the component matrices A and B are positive). 
The Tucker3 model is not at all uniquely identified: As already shown by 
Tucker (1966), the fit is not affected by arbitrarily multiplying each of the 
component matrices by a nonsingular square matrix, provided that the core is 
multiplied appropriately by the inverse of these transformations. Specifically, 
postmultiplication of the component matrices A, B, and C by nonsingular 
matrices S, T, and U, respectively, does not affect the model estimates if 
the core array is multiplied in the appropriate way by S -1 , T -1 , and U -1 , 
respectively. 

To identify the Tucker3 solution, a first commonly used step is to require 
the component matrices to be columnwise orthonormal, which reduces the 
transformational nonuniqueness to rotational nonuniqueness. A requirement 
to further identify the solution is to rotate the component matrices to what 
Kiers (2004) called the ’’principal axes” orientation of the Tucker3 solution. 
This identifies the rotation of all component matrices, as well as their per- 
mutation. 

The principal axes solution has nice theoretical properties, but usually is not 
easy to interpret. Alternatively, identification can be obtained by some simple 
structure rotation of the core and component matrices (Kiers (1998)). Such 
rotations identify the Tucker3 solution up to permutation and scaling. We 
thus end up in the same situation as with CP, and can hence use the same 
procedure to obtain full identification (see above). 

Above it has been shown how the CP solution and the Tucker3 solution can 
be identified completely. If we would use exactly the same identification pro- 
cedure for all bootstrap solutions, then we can compare bootstrap solutions, 
and sensibly compute percentile intervals, and use these as estimates of con- 
fidence intervals for our parameters. However, in doing so, we imply that in 
our actual data analysis, we consider as our solution only the one that we get 
from exactly the same identification procedure. As a consequence, if we would 
have two samples from the same population, and we analyze both in exactly 
the same way, and it so happens that the solutions are almost identical but 
have a different ordering of the columns or (in case of Tucker3 analysis) a 
different rotation of the component matrix at hand, then we would not recog- 
nize this near identity of the solutions (as is illustrated by Kiers (2004)). To 
avoid overlooking such near similarities, we should not take the identifications 
used for obtaining the bootstrap solutions too seriously, and we should use 



Bootstrap Confidence Intervals for Three-way Component Methods 



77 



procedures that consider bootstrap solutions similar when they only differ by 
permutations or (in case of Tucker3) other nonsingular transformations. 

3 Smaller bootstrap intervals using transformations 

When, in computing percentile intervals, we wish to consider bootstrap solu- 
tions as similar when they differ only by a permutation and scaling, this can 
be taken into account as follows. Before comparing bootstrap solutions, the 
components are all reflected and permuted (as far is possible without affect- 
ing the fit) in such a way that they optimally resemble the sample solution. In 
case of Tucker3, also the core should be appropriately reflected and rescaled. 
For details, the reader is referred to Riu and Bro (2003) or Kiers (2004), 
who offer two slightly different procedures for achieving this. As an obvious 
consequence, then the bootstrap solutions will also resemble each other well. 
For each loading and core value, the associated 500 values in the result- 
ing permuted and reflected bootstrap solutions can then be used to set up 
a 95%percentile interval. Typically, these intervals will be smaller than the 
ones based on fully identified solutions. If orderings and scalings are not to 
be taken seriously (as is often the case in practice), this is indeed desirable, 
because then the fully identified solutions would lead to artificially wide in- 
tervals, as bootstrap solutions that differ mainly in irrelevant ways (i.e., by 
permutations and/or scalings) would nevertheless be considered as strongly 
different. 

As mentioned above, fully identified Tucker3 bootstrap solutions may differ 
considerably even when fit preserving transformations exist that make them 
almost equal. For example, simple structure rotation may lead to very dif- 
ferent solutions for two bootstrap samples, when for the original sample two 
rather different rotations will yield almost the same degree of simplicity; in 
such cases solutions for some bootstrap samples may, after rotation, resem- 
ble one rotated sample solution, while others may resemble the other rotated 
sample solution, even when, before rotation, both would resemble the orig- 
inal unrotated sample solution very much. Often, the optimal simplicity of 
a solution, as such, is not taken seriously (similarly as the actual ordering 
with respect to sums of squares is usually not taken seriously). Then, it is 
appropriate to consider as similar all bootstrap solutions that are similar af- 
ter an optimal transformation towards each other, or to a reference solution. 
This idea has repeatedly been used in bootstrap or jack-knife procedures 
for two-way analysis techniques (Meulman and Heiser (1983), Krzanowski 
(1987), Markus (1994), Milan and Whittaker (1995), Groenen, Commandeur 
and Meulman (1998)). These two-way techniques cannot as such be used in 
the three-way situation. For Tucker3, Kiers (2004) proposed two procedures 
to make three-way bootstrap solutions optimally similar to the sample solu- 
tion. The first uses ’only’ rotational freedom, leaving intact the columnwise 
orthonormality of the component matrices; the other uses the full transfer- 
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mational freedom in the Tucker3 model. The basic idea is as follows. 

Let a Tucker3 solution be given by A, B, C and G, and bootstrap solutions 
be indicated by A b , B b , C b and G b . As in the usual solutions, the compo- 
nent matrices are columnwise orthonormal. Now we want to transform B b , 
C b and G 6 such that they become optimally similar to B, C and G, respec- 
tively. Thus, we first search (possibly orthonormal) transformation matrices 
T and U, such that B b T, and C b U become optimally similar to B and, 
C, respectively. For this purpose, we minimize /i(T) =|| B b T — B || 2 and 
$ 2 (U) =|| C b U C || 2 . The inverses of the optimal transformations T and U 
are then appropriately applied to the core array G b . Next, transformational 
freedom for the first mode (associated with component matrix A b and the 
first mode of the core array) is exploited by transforming the current boot- 
strap core array across the first mode such that it optimally resembles the 
sample core array G in the least squares sense. See Kiers (2004) for technical 
details. 

For each loading and core value, the associated 500 values in the resulting 
transformed bootstrap solutions can be used to set up a 95%percentile inter- 
val. These intervals will typically be smaller than both the ones based on fully 
identified solutions, and the ones based on only permuting and scaling boot- 
strap solutions, because now similarity across all possible transformations is 
taken into account. 



4 Performance of bootstrap confidence intervals 

The above described bootstrap percentile intervals are considered estimates 
of confidence intervals. This means that, if we have a 95%percentile interval, 
we would like to conclude from this that with 9% certainty it covers the true 
population parameter. If we would work with fully identified solutions, then 
it is clear what the actual population parameters refer to. When transforma- 
tional freedom is used, the coverage property of our intervals should be that, 
in 95% of all possible samples from our population, after optimal transfor- 
mation of the population component matrices and core towards their sample 
counterparts, the population parameters fall in the confidence intervals we 
set up. Obviously, for transformation we should read ’’permutation and scal- 
ing”, or ’’orthonormal rotation”, if this is the kind of transformation actually 
used. 

By means of a simulation study, Kiers (2004) assessed the coverage proper- 
ties of the above described bootstrap procedures, both for CP and Tucker3. 
Specifically, first, large population data sets were constructed according to 
the three-way model at hand, to which varying amounts of noise were added. 
The numbers of variables were 4 or 8, and the numbers of conditions were 6 or 
20; the numbers of components used varied between 2 and 4. The appropriate 
three-way solution was computed for the population. Next, samples (of sizes 
20, 50, and 100) were drawn from this population, the three-way method at 
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hand was applied to each sample, and this was followed by a bootstrap pro- 
cedure set up in one of the ways described above. Finally, for each parameter 
it was assessed whether the population parameter, after optimal transforma- 
tion of the population solution towards the sample solution, was covered by 
the 95%bootstrap confidence interval estimated for it. Across all samples and 
populations, the percentage of coverages should be 95%, and it was verified 
whether the actual coverages came near to this percentage. 

It was found in the simulation studies that the overall average coverages per 
method and per type of parameter (variable loading, situation loading, or 
core), ranged from 92% to 95%, except for Tucker3 using the principal axes 
solution and making bootstrap solutions comparable to it by only using per- 
mutations and reflections ( here the worst average coverage was found for 
the elements of the B matrices: 85%). Such overall average coverages are 
optimistic, because they may have resulted from over- and undercoverages 
cancelling each other. Therefore, coverage percentages for individual condi- 
tions (which are less reliable, because they pertain to smaller numbers of 
cases) were also inspected. It turned out that these range from roughly 84% 
to 98%, disregarding the troublesome case mentioned above. The lowest cov- 
erage percentages were found for the smallest sample size (20). It can be 
concluded that, when interpreting the bootstrap percentile intervals as con- 
fidence intervals, it should be taken into account that they tend to be too 
small, especially in case of sample sizes as small as 20. 

For a single full bootstrap analysis, 500 three-way component analyses have 
to be carried out, so it is important to know whether this can be done in rea- 
sonable time. Kiers (2004) reported that, for the largest sizes in his study, the 
Tucker3 bootstrap analyses cost about 30 seconds, which seems acceptable. 
For CP, however, even for sample sizes of 50, computation time was about 5 
minutes. Fortunately, a procedure using the sample solution as start for the 
bootstrap analyses, can help to decrease this computation time considerably, 
while not affecting the coverage performance. 

5 An application: Bootstrap confidence intervals for 
results from a Tucker3 Analysis 

Kiers and van Mechelen (2001) reported the Tucker3 analysis of the scores 
of 140 subjects on 14 five-point scales measuring the degree of experiencing 
various anxiety related phenomena in 11 different stressful situations. The 
data have been collected by Maes, Vandereycken, and Sutren at the Univer- 
sity of Leuven, Belgium. Here we have reanalyzed their data, using the very 
same options as they used, and now computed bootstrap confidence intervals 
for the outcomes. Here we used the procedure where the bootstrap compo- 
nent loadings for the anxiety scales and for the situations, and the core array 
are matched by means of optimal orthogonal rotations to the corresponding 
sample component matrices and core. 
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The bootstrap confidence intervals for the loadings of the anxiety scales are 
given in Table 1, and those for the situations in Table 2. In Table 3, the core 
values are reported, and, to keep the results insightful, only for the values 
that play a role in the interpretation, confidence intervals are reported. 



Table 1. Confidence intervals for loadings of anxiety scales on components (the 
latter interpreted as by Kiers and van Mechelen, 2001). 



component) 
anxiety scale 


” Approach- 
avoidance” 


” Autonomic 
physiology” 


’’Sickness” 


” Excretory 
need” 


Heart beats faster 


-0.12 


-0.00 


0.44 


0.64 


-0.19 


0.07 


-0.26 


-0.08 


’’Uneasy feeling” 


-0.34 


-0.19 


0.15 


0.38 


-0.05 


0.23 


-0.18 


0.02 


Emotions disrupt 


-0.25 


-0.09 


0.11 


0.34 


0.10 


0.41 


-0.15 


0.11 


Feel exhilarated 


0.41 


0.52 


0.04 


0.20 


-0.03 


0.17 


-0.00 


0.15 


Not want to avoid 


0.29 


0.45 


-0.22 


-0.03 


-0.12 


0.14 


-0.09 


0.10 


Perspire 


-0.13 


-0.02 


0.41 


0.58 


-0.11 


0.11 


-0.11 


0.05 


Need to urinate 


-0.04 


0.15 


0.02 


0.35 


-0.24 


0.23 


0.34 


0.61 


Enjoy challenge 


0.43 


0.53 


0.02 


0.17 


-0.01 


0.18 


-0.05 


0.08 


Mouth gets dry 


-0.06 


0.16 


0.12 


0.51 


-0.24 


0.25 


0.18 


0.49 


Feel paralyzed 


-0.14 


0.02 


0.01 


0.33 


0.03 


0.44 


0.07 


0.36 


Full stomach 


-0.09 


0.05 


-0.10 


0.13 


0.54 


0.85 


-0.20 


0.05 


Seek experiences 


0.42 


0.54 


0.03 


0.23 


-0.05 


0.22 


-0.12 


0.06 


Need to defecate 


-0.15 


0.03 


-0.27 


0.10 


-0.30 


0.25 


0.48 


0.81 


Feel nausea 


-0.19 


-0.08 


-0.24 


-0.03 


0.28 


0.54 


0.12 


0.35 



Note: intervals for high loadings used in the original interpretation are set in 
bold. 



The confidence intervals for the anxiety scale loadings vary somewhat in 
width, but are usually rather small for the highest loadings (which are set 
in bold face). These values are the ones on which Kiers and van Mechelen 
(2001) based their interpretation of the components, hence it is comforting to 
see that these intervals are usually not too wide. The main exceptions are the 
intervals for the loadings of ’’Mouth gets dry” on ’’Autonomic physiology” 
and ’’Excretory need”, which are both wide, and which indicates that it is 
not at all clear with which component this anxiety scale is related strongest 
(which is remarkable, because this refers clearly and solely to an autonomic 
physiological reaction) . Without confidence intervals, this unclarity had gone 
unnoticed. 

The confidence intervals for the situation loadings vary somewhat more in 
width. Again the intervals for the highest loadings are set in bold face. They 
are small for ’’Performance judged by others”, but for ’’Inanimate danger” 
especially the ’’Sail boat on rough sea” situation has a wide interval, and 
both highest loadings on the ” Alone in woods at night” component have wide 
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intervals as well. Clearly, the judgement component is better determined than 
the other two. 



Table 2. Confidence intervals for component values of situations on components 
(the latter interpreted as by Kiers and van Mechelen, 2001). 



component\ 

situation 


Performance judged 
by others 


Inanimate 

danger 


Alone in wood 
at night 


Auto trip 


0.02 


0.22 


0.01 


0.26 


-0.29 


0.20 


New date 


0.14 


0.31 


0.03 


0.24 


-0.40 


-0.03 


Psychological experiment 


-0.19 


0.23 


-0.12 


0.42 


-0.30 


0.72 


Ledge high on mountain side 


-0.05 


0.19 


0.56 


0.87 


-0.18 


0.30 


Speech before large group 


0.36 


0.55 


-0.25 


0.02 


-0.24 


0.12 


Consult counseling bureau 


0.10 


0.45 


-0.27 


0.12 


-0.25 


0.53 


Sail boat on rough sea 


0.02 


0.29 


0.27 


0.70 


-0.34 


0.26 


Match in front of audience 


0.22 


0.46 


-0.02 


0.26 


-0.25 


0.24 


Alone in woods at night 


0.06 


0.33 


-0.09 


0.17 


0.25 


0.93 


Job-interview 


0.36 


0.51 


-0.18 


0.01 


-0.13 


0.21 


Final exam 


0.38 


0.53 


-0.29 


-0.03 


0.02 


0.33 



Note: intervals for high loadings used in the original interpretation are set in 
bold. 

The core values are used to interpret the components for the individu- 
als indirectly, through the interpretation of the components for the anxiety 
scales and the situations, see Kiers and van Mechelen (2001). The confidence 
intervals for the highest core values are usually relatively small. This is even 
the case for the values just higher than 10 (in absolute sense). The confidence 
intervals suggest even these smaller values can be taken rather seriously. 

6 Discussion 

In the present paper, procedures have been described for determining boot- 
strap percentile intervals for all parameters resulting from a Tucker3 or CP 
three-way analysis. These can be used as such, that is, as intervals indi- 
cating the stability of solutions across resampling from the same data, and 
hence give an important primary indication of their reliability. However, it 
was found that the 95%percentile intervals also turn out to be fairly good 
approximations to 95%confidence intervals in most cases. Thus, they can at 
least tentatively be used as confidence intervals as well. Some improvement 
of these intervals, however, still remains to be desired. 

Note: core values higher than 10 (in absolute sense) are set in bold. 
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Table 3. Core array with confidence intervals, in brackets, only for the high core 
values. Labels Al,. . . ,A6 refer to the 6 components for summarizing the subjects. 



Performance judged by others 





Appr. Avoid. 


Auto.phys. 


Sickness 


Excr.need 


Al 


36.5 [28,45] 


-1.0 


-0.5 


-0.2 


A2 


0.8 


1.6 


-0.3 


2.2 


A3 


-0.2 


0.7 


-0.1 


36.9 [29,43] 


A4 


-0.9 


40.0 [31,48] 


1.2 


1.2 


A5 


0.5 


-0.1 


1.2 


0.9 


A6 


-0.3 


1.0 


34.9 [28, 43] 


0.2 



Inanimate Danger 





Appr. A void. 


Auto.phys. 


Sickness 


Excr.need 


Al 


1.6 


3.4 


2.0 


1.0 


A2 


30.3 [25,34] 


-11.0 [-15,-7] 


-11.8 [-15,-8] 


-9.0 


A3 


2.8 


3.5 


2.4 


15.2 [9,20] 


A4 


2.7 


11.2 [5,16] 


0.6 


-0.6 


A5 


0.4 


-2.6 


1.9 


1.9 


A6 


-0.4 


-4.0 


6.5 


-4.7 



Alone in woods at night 





Appr. A void. 


Auto.phys. 


Sickness 


Excr.need 


Al 


2.5 


4.3 


1.7 


-2.2 


A2 


0.4 


-0.4 


0.8 


2.4 


A3 


1.6 


-0.5 


3.9 


12.4 [7, 16] 


A4 


1.2 


5.0 


-4.8 


-7.0 


A5 


26.4 [19, 30] 


-18.4 [-22, -11] 


-8.3 


-6.6 


A6 


3.0 


1.7 


9.8 


2.2 



Different procedures have been proposed, depending on which transfor- 
mations one allows for the bootstrap solutions. The choice between these 
should be made on theoretical grounds, not on empirical grounds. That is, 
this depends on whether or not the ordering of components in terms of col- 
umn sums of squares, and the optimal simplicity of solutions in terms of 
varimax is taken seriously or not. If such characteristics are not taken seri- 
ously, indeed one should use all the rotational freedom that is available in 
setting up bootstrap intervals. 

The approximate confidence intervals given here pertain to each individual 
output parameter. However, obviously, the output parameters are not in- 
dependent from each other. For instance, already the unit column sums of 
squares constraints on the component matrices ensure that elements within 
columns of such matrices depend on each other. Moreover, the optimality of 
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a solution does not depend on each parameter individually, but on the com- 
plete configuration of all output parameters. Thus, one may expect that, if 
a percentile interval for a particular element of B, say, does not contain the 
population parameter value, then it is rather likely that percentile intervals 
for other elements of B will not cover their population counterparts either. 
Such dependence even holds for elements from different matrices: Consider 
that an ’extreme’ solution for B is found (such that the associated percentile 
intervals miss most of the population parameters), then this will most likely 
also affect the solution for the core G (and hence lead to misfitting percentile 
intervals for many elements of G) . Clearly, further research is needed to deal 
with the dependence of output parameters. For now, it suffices to remark 
that the confidence intervals are each taken as if they ’were on their own’, 
and in interpreting the confidence intervals their dependence should not be 
overlooked, in particular when they are to be used to make probability state- 
ments on sets of parameters jointly. 

The bootstrap method is sometimes called a computer intensive method. 
When we apply it to three-way analysis, indeed, this intensity becomes ap- 
parent, especially when using CP. Computation times for moderately sized 
problems are nonnegligible, although not prohibitive. Some speed improve- 
ment was obtained, and further speed improvement may be possible. All in 
all, however, it can be concluded that the bootstrap now is a viable procedure 
for estimating confidence intervals for the results from exploratory three-way 
methods. 
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Abstract. Software development has become a distributed, collaborative process 
based on the assembly of off-the-shelf and purpose-built components. The selection 
of software components from component repositories and the development of com- 
ponents for these repositories requires an accessible information infrastructure that 
allows the description and comparison of these components. 

General knowledge relating to software development is equally important in 
this context as knowledge concerning the application domain of the software. Both 
form two pillars on which the structural and behavioural properties of software 
components can be expressed. Form, effect, and intention are the essential aspects 
of process-based knowledge representation with behaviour as a primary property. 

We investigate how this information space for software components can be or- 
ganised in order to facilitate the required taxonomy, thesaurus, conceptual model, 
and logical framework functions. Focal point is an axiomatised ontology that, in 
addition to the usual static view on knowledge, also intrinsically addresses the dy- 
namics, i.e. the behaviour of software. Modal logics are central here - providing a 
bridge between classical (static) knowledge representation approaches and behav- 
iour and process description and classification. 

We relate our discussion to the Web context, looking at Web services as com- 
ponents and the Semantic Web as the knowledge representation framework. 



1 Introduction 

The style of software development has changed dramatically over the past 
decades. Software development has become a distributed, collaborative pro- 
cess based on the assembly of off-the-shelf and purpose-built software com- 
ponents - an evolutionary process that in the last years has been strongly 
influenced by the Web as a software development and deployment platform. 

This change in the development style has an impact on information and 
knowledge infrastructures surrounding these software components. The se- 
lection of components from component repositories and the development of 
components for these repositories requires an accessible information infra- 
structure that allows component description, classification, and comparison. 
Organising the space of knowledge that captures the description of properties 
and the classification of software components based on these descriptions is 
central. Discovery and composition of software components based on these 
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descriptions and classifications have become central activities in the soft- 
ware development process (Crnkovic and Larsson (2002)). In a distributed 
environment where providers and users of software components meet in elec- 
tronic marketplaces, knowledge about these components and their proper- 
ties is essential; a shared knowledge representation language is a prerequisite 
(Horrocks et al. (2003)). Describing software behaviour, i.e. the effect of the 
execution of services that a component might offer, is required. 

We will introduce an ontological framework for the description and classi- 
fication of software components that supports the discovery and composition 
of these components and their services - based on a formal, logical coverage 
of this topic in (Pahl (2003)). Terminology and logic are the cornerstones of 
our framework. Our objective is here twofold: 

• We will illustrate an ontology based on description logics (a logic under- 
lying various ontology languages), i.e. a logic-based terminological clas- 
sification framework based on (Pahl (2003)). We exploit a connection to 
modal logics to address behavioural aspects, in particular the safety and 
liveness of software systems. 

• Since the World-Wide Web has the potential of becoming central in future 
software development approaches, we investigate whether the Web can 
provide a suitable environment for software development and what the 
requirements for knowledge-related aspects are. In particular Semantic 
Web technologies are important for this context. 

We approach the topic here from a general knowledge representation and 
organisation view, rather than from a more formal, logical perspective. 

In Section 2 we describe the software development process in distributed 
environments in more detail. In Section 3, we relate knowledge representation 
to the software development context. We define an ontological framework for 
software component description, supporting discovery and composition, in 
Section 4. We end with some conclusions in Section 5. 

2 The software development process 

The World-Wide Web is currently undergoing a change from a document- to 
a services-oriented environment. The vision behind the Web Services Frame- 
work is to provide an infrastructure of languages, protocols, and tools to 
enable the development of services-oriented software architectures on and 
for the Web (W3C (2004)). Service examples range from simple informa- 
tion providers, such as weather or stock market information, to data storage 
support and complex components supporting e-commerce or online banking 
systems. An example for the latter is an account management component of- 
fering balance and transfer services. Service providers advertise their services; 
users (potential clients of the provider) can browse repository-based market- 
places to find suitable services, see Fig. 1. The prerequisite is a common 
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language to express properties of these Web-based services and a classifica- 
tion approach to organise these. The more knowledge is available about these 
services, the better can a potential client determine the suitability of an offer. 

Services and components are related concepts. Web services can be pro- 
vided by software components; we will talk about service components in 
this case. If services exhibit component character, i.e. are self-contained with 
clearly defined interfaces that allows them to be reused and composed, then 
their composition to larger software system architectures is possible. Plug- 
gable and reusable software components are one of the approaches to software 
developments that promises risk minimisation and cost reduction. Composi- 
tion can be physical, i.e. a more complex artefact is created through assembly, 
or logical, i.e. a complex system is created by allowing physically distributed 
components to interact. Even though our main focus are components in gen- 
eral, we will discuss them here in the context of the Web Services platform. 

The ontological description of component properties is our central concern 
(Fig. 1). We will look at how these descriptions are used in the software 
development process. Two activities are most important: 

• Discovery of provided components (lower half of Fig. 1) in structured 
repositories. Finding suitable, reusable components for a given develop- 
ment based on abstract descriptions is the problem. 

• Composition of discovered components in complex service-based compo- 
nent architectures through interaction (upper half of Fig. 1). Techniques 
are needed to compose the components in a consistent way based on their 
descriptions. 

For a software developer, the Web architecture means that most software de- 
velopment and deployment activities will take place outside the boundaries of 
her/his own organisation. Component descriptions can be found in external 





repositories. These components might even reside as provided services out- 
side the own organisation. Shared knowledge and knowledge formats become 
consequently essential. 



3 A knowledge space for software development 

The Web as a software platform is characterised by different actors, different 
locations, different organisations, and different systems participating in the 
development and deployment of software. As a consequence of this hetero- 
geneous architecture and the development paradigm as represented in Fig. 
1, shared and structured knowledge about components plays a central role. 
A common understanding and agreement between the different actors in the 
development process are necessary. 

A shared, organised knowledge space for software components in service- 
oriented architectures is needed. The question how to organise this knowledge 
space is the central question of this paper. In order to organise the knowledge 
space through an ontological framework (which we understand essentially as 
basic notions, a language, and reasoning techniques for sharable knowledge 
representation), we address three facets of the knowledge space: firstly, types 
of knowledge that is concerned, secondly, functions of the knowledge space, 
and, finally, the representation of knowledge (Sowa (2000)). 

Three types of knowledge can be represented in three layers: 

• The application domain as the basic layer. 

• Static and dynamic component properties as the central layer. 

• Meta-level activity-related knowledge about discovery and composition. 

We distinguish four knowledge space functions (Daconta et al. (2003)) that 
characterise how knowledge is used to support the development activities: 

• Taxonomy - terminology and classification; supporting structuring and 
search. 

• Thesaurus - terms and their relationships; supporting a shared, controlled 
vocabulary. 

• Conceptual model - a formal model of concepts and their relationships; 
here of the application domain and the software technology context. 

• Logical theory - logic-supported inference and proof; here applied to be- 
havioural properties. 

The third facet deals with how knowledge is represented. In general, knowl- 
edge representation (Sowa (2000)) is concerned with the description of entities 
in order to define and classify these. Entities can be distinguished into ob- 
jects (static entities) and processes (dynamic entities). Processes are often 
described in three aspects or tiers: 

• Form - algorithms and implementation - the ‘how’ of process description 
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• Effect - abstract behaviour and results - the ‘what’ of process description 

• Intention - goal and purpose - the ‘why’ of process description 

We have related the aspects form, effect, and intention to software character- 
istics such as algorithms and abstract behaviour. The service components are 
software entities that have process character, i.e. we will use this three-tiered 
approach for their description. 

The three facets of the knowledge space outline its structure. They serve 
as requirements for concrete description and classification techniques, which 
we will investigate in the remainder. 



4 Organising the knowledge space 

4.1 Ontologies 

Ontologies are means of knowledge representation, defining so-called shared 
conceptualisations. Ontology languages provide a notation for terminological 
definitions that can be used to organise and classify concepts in a domain. 
Combined with a symbolic logic, we obtain a framework for specification, 
classification, and reasoning in an application domain. Terminological logics 
such as description logics (Baader et al. (2003)) are an example of the latter. 

The Semantic Web is an initiative for the Web that builds up on ontology 
technology (Berners-Lee et al. (2001)). XML - the extensible Markup Lan- 
guage - is the syntactical format. RDF - the Resource Description Framework 
- is a triple-based formalism (subject, property, object) to describe entities. 
OWL - the Web Ontology Language - provides additional logic-based rea- 
soning based on RDF. 

We use Semantic Web-based ontology concepts to formalise and axioma- 
tise processes, i.e. to make statements about processes and to reason about 
them. Description logic, which is used to define OWL, is based on concept 
and role descriptions (Baader et al. (2003)). Concepts represent classes of 
objects; roles represent relationships between concepts; and individuals are 
named objects. Concept descriptions are based on primitive logical combina- 
tors (negation, conjunction) and hybrid combinators (universal and existen- 
tial quantification). Expressions of a description logic are interpreted through 
sets (concepts) and relations (roles). 

We use a connection between description logic and dynamic logic (Sattler 
et al. (2003), Chapter 4.2.2). A dynamic logic is a modal logic for the de- 
scription of programs and processes based on operators to express necessity 
and possibility (Kozen and Tiuryn (1990)). This connection allows us to ad- 
dress safety (necessity of behaviour) and liveness (possibility of behaviour) 
aspects of service component behaviour by mapping the two modal opera- 
tors ‘box’ (or ‘always’, for safety) and ‘diamond’ (or ‘eventually’, for liveness) 
to the description logic universal and existential quantification, respectively. 
The central idea behind this connection is that roles can be interpreted as 
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(Literal) (Literal) 

Fig. 2. A Service Component Ontology. 



accessibility relations between states, which are central concepts of process- 
oriented software systems. The correspondence between description logics 
and a multi-modal dynamic logic is investigated in detail in (Schild (1991)). 



4.2 A discovery and composition ontology 

An intuitive approach to represent software behaviour in an ontological form 
would most likely be to consider components or services as the central con- 
cepts (DAML-S Coalition (2002)). We, however, propose a different approach. 
Our objectve is to represent software systems. These systems are based on 
inherent notions of state and state transition. Both notions are central in our 
approach. Fig. 2 illustrates the central ideas. Service executions lead from old 
(pre)states to new (post)states, i.e. the service is represented as a role (a rec- 
tangle in the diagram), indicated through arrows. The modal specifications 
characterise in which state executions might (using the possibility operator to 
express liveness properties) or should (using the necessity operator to express 
safety properties) end. For instance, we could specify that a customer may 
(possibly) check his/her account balance, or, that a transfer of money must 
(necessarily) result in a reduction of the source account balance. Transitional 
roles such as Service in Fig. 2 are complemented by more static, descriptional 
roles such as preCond or inSign, which are associated through non-directed 
connections. For instance, preCond associates a precondition to a prestate; 
inSign associates the type signatures of possible service parameters. Some 
properties, such as the service name servName , will remain invariant with 
respect to state change. 

Central to our approach is the intrinsic specification of process behav- 
iour in the ontology language itself. Behaviour specifications based on the 
descriptions of necessity and possibility are directly accessible to logic-based 
methods. This makes reasoning about behaviour of components possible. 

We propose a two-layered ontology for discovery and composition. The 
upper ontology layer supports discovery , i.e. addresses description, search, 
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discovery, and selection. The lower ontology layer supports composition , i.e. 
addresses the assembly of components and the choreography of their interac- 
tions. We assume that execution-related aspects are an issue of the provider 
- shareable knowledge is therefore not required. 

Table 1 summarises development activities and knowledge space aspects. 
It relates the activities discovery, composition, and execution on services 
(with the corresponding ontologies) to the three knowledge space facets. 



Table 1 . Development Activities and Knowledge Space Facets. 





Knowledge Aspect 


Knowledge Type 


Function 


Discovery 
(upper ontology) 


intention 

(terminology) 


domain 


taxonomy 

thesaurus 


Composition 
(lower ontology) 


effect 

(behaviour) 


component 
component activities 


conceptual model 
logical theory 


Execution 


form 

(implementation) 


component 


conceptual model 



4.3 Description of components 

Knowledge describing software components is represented in three layers. We 
use two ontological layers here to support the abstract properties. 

• The intention is expressed through assumptions and goals of services in 
the context of the application domain. 

• The effect is a contract-based specification of system invariants, pre- and 
postconditions describing the obligations of users and providers. 

• The form defines the implementation of service components, usually in a 
non-ontological, hidden format. 

We focus on effect descriptions here. Effect descriptions are based on modal 
operators. These allow us to describe process behaviour and composition 
based on the choreography of component interactions. The notion of compo- 
sition shall be clarified now. Composition in Web- and other service-oriented 
environments is achieved in a logical form. Components are provided in form 
of services that will reside in their provider location. Larger systems are 
created by allowing components to interact through remote operation in- 
vocation. Components are considered as independent concurrent processes 
that can interact (communicate) with each other. Central in the composition 
are the abstract effect of individual services and the interaction patterns of 
components as a whole. 

We introduce role expressions based on the role constructors sequential 
composition R ; S, iteration \R , and choice R + S into a basic ontology lan- 
guage to describe interaction processes (Pahl (2003)). We often use Ro S 
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instead of R; S if R and S are functional roles, i.e. are interpreted by func- 
tions - this notation will become clearer when we introduce names and ser- 
vice parameters. Using this language, we can express ordering constraints for 
parameterised service components. These process expressions constrain the 
possible interaction of a service component with a client. 

For instance, Login ; \(Balance + Transfer ) is a role expression describing 
an interaction process of an online banking user starting with a login, then 
repeatedly executing balance enquiry or money transfer. 

An effect specification 1 focussing on safety is for a given system state 

\/preCond.positive(Balance(no )) and 

V Transfer. \/postC ond.reduced{ Balance{no ) ) 

saying that if the account balance for account no is positive, then money 
can be transfered, resulting (necessarily) in a reduced balance. Transfer is a 
service; positive(Balance(no)) and reduced(Balance(no)) are pre- and post- 
condition, respectively. These conditions are concept expressions. The specifi- 
cation above is formed by navigating along the links created by roles between 
the concepts in Fig. 2 - Transfer replaces Service in the diagram. 

In Fig. 3, we have illustrated two sample component descriptions - one 
representing the requirements of a (potential) client, the other representing a 
provided bank account component. Each component lists a number of individ- 
ual services (operations) such as Login or Balance. We have used pseudocode 
for signatures (parameter names and types) and pre-/postconditions - a for- 
mulation in proper description logic will be discussed later on. We have 
limited the specification in terms of pre- and postconditions to one service, 
Transfer. 

The requirements specification forms a query as a request, see Fig. 1. The 
ontology language is the query language. The composition ontology provides 
the vocabulary for the query. A query should result ideally in the identifica- 
tion of a suitable (i.e. matching) description of a provided component. In our 
example, the names correspond - this, however, is in general not a matching 
prerequisite. Behaviour is the only definitive criterion. 

4.4 Discovery and composition of components 

Component-based development is concerned with discovery and composition. 
In the Web context, both activities are supported by Semantic Web and Web 
Services techniques. They support semantical descriptions of components, 
marketplaces for the discovery of components based on intention descriptions 
as the search criteria, and composition support based on semantic effect de- 
scriptions. The deployment of components is based on the form description. 

1 This safety specification serves to illustrate effect specification. We will improve 
this currently insufficient specification (negative account balances are possible, 
but might not be desired) in the next section when we introduce names and 
parameters. 
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Component AccountRequirements 

signatures and pre- /postconditions 
Login 
inSign 
outSign 
Balance 
inSign 
outSign 
Transfer 
inSign 
outSign 
preCond Balance (no) 

postCond Balance (no) 

Logout 

inSign no : int 

outSign void 

interaction process 

Login; ! Balance; Logout 



no : int , user : string 
void 

no : int 
real 

no : int , dest : int , sum : real 
void 



> sum 

= Balance (no) @pre 



sum 



Component BankAccount 

signatures and pre- /postconditions 

Login (no : int , user : string) 

inSign no : int , user : string 

outSign void 

Balance (no : int) : real 
inSign no : int 

outSign real 

Transfer (no : int ,dest : int , sum: real) 
inSign no : int , dest : int , sum: real 

outSign void 

preCond true 

postCond Balance (no) = Balance (no)pre - sum 

Logout (no : int) 

inSign no : int 

outSign void 

interaction process 

Login; ! (Balance+Transf er) ;Logout 

Fig. 3. Bank Account Component Service. 



Query and Discovery. The aim of the discovery support is to find suitable 
provided components in a first step that match based on the application 
domain related goals and that, in a second step, match based on the more 
technical effect descriptions. Essentially, the ontology language provides a 
query language. The client specifies the requirements in a repository query 
in terms of the ontology, which have to be matched by a description of a 
provided component. 
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Matching requires technical support, in particular for the formal effect 
descriptions. Matching can be based on techniques widely used in software 
development, such as refinement (which is for instance formalised as the con- 
sequence notion in dynamic logic). We will focus on the description of effects, 
i.e. the lower ontology layer (cf. Fig. 2): 

• Service component-based software systems are based on a central state 
concept; additional concepts for auxilliary aspects such as the pre- and 
poststate-related descriptions are available. 

• Service components are behaviourally characterised by transitional roles 
(for state changes between prestate and poststate) and descriptional roles 
(auxilliary state descriptions). 



Matching and composition. In order to support matching and com- 
position of components through ontology technology, we need to extend 
the (already process-oriented) ontology language we presented above (Pahl 
and Casey (2003)). We can make statements about component interaction 
processes, but we cannot refer to the data elements processed by services. 
The role expression sublanguage needs to be extended by names (represent- 
ing data elements) and parameters (which are names passed on to services 
for processing): 

• Names: a name is a role n[iVaroe] defined by the identity relation on the 
interpretation of an individual n. 

• Parameters: a parameterised role is a transitional role R applied to a 
name n[Name\, i.e. Ron[Name]. 

We can make our Transfer service description more precise by using a data 
variable (sum) in pre- and postconditions and as a parameter: 

\/preCond.(Balance(no ) > sum) and 

V Transferosum[Name].\/postCond.(Balance(no) = Balance(no)@pre — sum) 

This specification requires Transfer to decrease the pre-execution balance by 
sum. 

Matching needs to be supported by a comparison construct. We already 
mentioned a refinement notion as a suitable solution. This definition, however, 
needs to be based on the support available in description logics. Subsumption 
is the central inference technique. Subsumption is the subclass relationship 
on concept and role interpretations. We define two types of matching : 

• For individual services , we define a refinement notion based on weaker pre- 
conditions (allowing a service to be invoked in more states) and stronger 
postconditions (improving the results of a service execution) . For example 
true as the precondition and Balance(no) = Balance (no) @pre — sum. as 
the postcondition for Transfer o sum[Name] matches, i.e. refines the re- 
quirements specification with Balance(no ) >= sum as the precondition 
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and Balancefno) = Balance(no)@pre — sum as the postcondition since 
it allows the balance to become negative (i.e. allows more flexibility for 
an account holder). 

• For service processes , we define a simulation notion based on sequential 
process behaviour. A process matches another process if it can simulate 
the other’s behaviour. For example the expression Login', \{Balance + 
Transfer)' Logout matches, i.e. simulates Login ; ! Balance', Logout, since 
the transfer service can be omitted. 

Both forms of matching are sufficient criteria for subsumption. Matching of 
effect descriptions is the prerequisite for the composition of services. Matching 
guarantees the proper interaction between composed service components. 

5 Conclusions 

Knowledge plays an important role in the context of component- and service- 
oriented software development. The emergence of the Web as a development 
and deployment platform for software emphasises this aspect. 

We have structured a knowledge space for software components in service- 
oriented architectures. Processes and their behavioural properties were the 
primary aspects. We have developed a process-oriented ontological model 
based on the facets form, effect, and intention. The discovery and the com- 
position of process-oriented service components are the central activities. This 
knowledge space is based on an ontological framework formulated in a de- 
scription logic. The defined knowledge space supports a number of different 
functions - taxonomy, thesaurus, conceptual model, and logical theory. These 
functions support a software development and deployment style suitable for 
the Web and Internet environment. 

Explicit, machine-processable knowledge is the key to future automation 
of software development activities. In particular, Web ontologies have the 
potential to become an accepted format that supports such an automation 
endeavour. 
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Abstract. In this paper we propose the Time Interval Multimedia Event (TIME) 
framework as a robust approach for recognition of multimedia patterns, e.g. high- 
light events, in soccer video. The representation used in TIME extends the Allen 
temporal interval relations and allows for proper inclusion of context and synchro- 
nization of the heterogeneous information sources involved in multimedia pattern 
recognition. For automatic classification of highlights in soccer video, we compare 
three different machine learning techniques, i.c. C4.5 decision tree, Maximum En- 
tropy, and Support Vector Machine. It was found that by using the TIME framework 
the amount of video a user has to watch in order to see almost all highlights can be 
reduced considerably, especially in combination with a Support Vector Machine. 



1 Introduction 

The vast amount of sport video that is broadcasted on a daily basis, is even for 
sports enthusiasts too much to handle. To manage the video content, annota- 
tion is required. However, manual annotation of video material is cumbersome 
and tedious. This fact has already been acknowledged by the multimedia re- 
search community more than a decade ago, and has resulted in numerous 
methodologies for automatic analysis and indexing of video documents, see 
Snoek and Worring (2005). 

However, automatic indexing methods suffer from the semantic gap or 
the lack of coincidence between the extracted information and its interpreta- 
tion by a user, as recognized for image indexing in Smeulders et al. (2000). 
Video indexing has the advantage that it can profit from combined analysis of 
visual, auditory, and textual information sources. For this multimodal index- 
ing, two problems have to be unravelled. Firstly, when integrating analysis 
results of different information channels, difficulties arise with respect to syn- 
chronization. The synchronization problem is typically solved by converting 
all modalities to a common layout scheme, e.g. camera shots, hereby ignor- 
ing the layout of the other modalities. This introduces the second problem, 
namely the difficulty to properly model context, i.e. how to include clues that 
do not occur at the exact moment of the highlight event of interest? When 
synchronization and context have been solved, multimodal video indexing 
might be able to bridge the semantic gap to some extent. 

* This research is sponsored by the ICES/KIS MIA project and TNO-TPD. 
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Existing methods for multimedia pattern recognition, or multimodal video 
indexing, can be grouped into knowledge based approaches (Babaguchi et al. 
(2002), Fischer et al. (1995)) and statistical approaches (Assfalg et al. (2002), 
Han et al. (2002), Lin and Hauptmann (2002), Naphade and Huang(2001)). 
The former approaches typically combine the output of different multimodal 
detectors into a rule based classifier. To limit model dependency, and improve 
the robustness, a statistical approach seems more promising. Various statis- 
tical frameworks can be exploited for multimodal video indexing. Recently 
there has been a wide interest in applying the Dynamic Bayesian Network 
(DBN) framework for multimedia pattern recognition (Assfalg et al. (2002), 
Naphade and Huang(2002)). Other statistical frameworks that were proposed 
include Maximum Entropy (Han et al. (2002)), and Support Vector Machines 
(Lin and Hauptmann (2002)). However, all of these frameworks suffer from 
the problems of synchronization and context, identified above. Furthermore, 
they lack satisfactory inclusion of the textual modality. 

To tackle the problems of proper synchronization and inclusion of contex- 
tual clues for multimedia pattern recogntion, we propose the Time Interval 
Multimedia Event (TIME) framework. Moreover, as it is based on statis- 
tics, TIME yields a robust approach for multimedia pattern recognition. To 
demonstrate the viability of our approach we provide a systematic evalua- 
tion of three statistical classifiers, using TIME, on the domain of soccer and 
discuss their performance. The soccer domain was chosen because contextual 
clues like replays and distinguishing camera movement don’t appear at the 
exact moment of the highlight event. Hence, their synchronization should be 
taken into account. We improve upon existing work related to soccer video 
indexing, e.g. Assfalg et al. (2002) and Ekin et al. (2003), by exploiting mul- 
timodal, instead of unimodal, information sources, and by using a classifier 
based on statistics instead of heuristics that is capable to handle both syn- 
chronization and context. 

The rest of this paper is organized as follows. First we introduce the TIME 
framework, discussing both representation and classification. Then we discuss 
the multimodal detectors used for classification of various highlight events in 
soccer video in section 3. Experimental results are presented in section 4. 



2 Multimedia event classification framework 

We view a video document from the perspective of its author (Snoek and 
Worring (2005)). Based on a predefined semantic intention, an author com- 
bines certain multimedia layout and content elements to express his message. 
For analysis purposes this authoring process should be reversed. Hence, we 
start with reconstruction of layout and content elements. To that end, dis- 
crete detectors, indicating the presence or absence of specific layout and con- 
tent elements, are often the most convenient means to describe the layout 
and content. This has the added advantage that detectors can be developed 
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3 Camera shots 
O Microphone shots 
^ Text shots 



^ Panning camera j I 

C Speech $•- * 

Excited speech I 

^ High motion j I, 

<j Close-up face i 

Goal related keyword _ ■ j 

!> : 

event 

Fig. 1. Detector based segmentation of a multimodal soccer video document into 
its layout and content elements with a goal event (box) and contextual relations 
(dashed arrows). 




time 



independently of one another. To combine the resulting detector segmenta- 
tions into a common framework, some means of synchronization is required. 
To illustrate, consider Fig. 1. In this example a soccer video document is 
represented by various time dependent detector segmentations, defined on 
different asynchronous layout and content elements. At a certain moment a 
goal occurs. Clues for the occurrence of this event are found in the detector 
segmentations that have a value within a specific position of the time- window 
of the event, e.g. excited speech of the commentator. But also in contextual 
detector segmentations that have a value before, e.g. a camera panning to- 
wards the goal area, or after the actual occurrence of the event, e.g. the 
occurrence of the keyword score in the time stamped closed caption. Clearly, 
in terms of the theoretical framework, it doesn’t matter exactly what the 
detector segmentations are. What is important is that we need means to ex- 
press the different visual, auditory, and textual detector segmentations into 
one fixed representation without loss of their original layout scheme. 

Hence, for automatic classification of a semantic event, w, we need to grasp 
a video document into a common pattern representation. In this section we 
first consider how to represent such a pattern, x, using multimodal detector 
segmentations and their relations, then we proceed with statistical pattern 
recognition techniques that exploit this representation for classification using 
varying complexity. 



2.1 Pattern representation 

Applying layout and content detectors to a video document results in various 
segmentations, we define: 

Definition 1 (TIME Segmentation) Decomposition of a video document 
into one or more series of time intervals, r, based on a set of multimodal 
detectors. 



100 Snoek and Worring 



Allen 

NoRelation 



Precedes 
t a 2 < t b i 



Overlaps 

rt a i<t b i 

I t b 1 <t a 2<t b 2 

Starts 

I I 2 < t 2 
During 

1 1 2 < t 2 

Finishes 

1 1 2“ t 2 
Equals 
1 t 2— t 2 









I 

I 



TIME 



f 2 <tVT 2 



[t 3 , > t b , -T 2 
lf 2 < tVT, 

t b ,-T, < f 2 < tVT, 



It 3 , < tVT, 
|t b i+Ti< t a 2 < t b 2 -T i 



t b T -Ti < t a i < t b ! +T, 
t a 2 <t b 2-Tl 



t a !> t b i+T, 
t a 2 < t b 2-T! 



<ff> tV Tt 

|t b 2 -Ti < t 3 2 < tf*2 +T ■) 



rt b ! -Tt < t a ! < t b ! +T! 

[t b 2 -Ti < t 3 2 < tf*2 +T i 



Fig. 2. Overview of the differences between exact Allen relations and TIME rela- 
tions, extended from Aiello et al. (2002). 



To model synchronization and context, we need means to express relations 
between these time intervals. Allen showed that thirteen relationships are 
sufficient to model the relationship between any two intervals. To be specific, 
the relations are: precedes, meets, overlaps, starts, during, finishes, equals, and 
their inverses, identified by adding J, to the relation name (Allen (1983)). For 
practical application of the Allen time intervals two problems occur. First, 
in video analysis exact alignment of start- or endpoints seldom occurs due 
to noise. Second, two time intervals will always have a relation even if they 
are far apart in time. To solve the first problem a fuzzy interpretation was 
proposed by Aiello et al. (2002) . The authors introduce a margin, Ti, to 
account for imprecise boundary segmentations, explaining the fuzzy nature. 
The second problem only occurs for the relations precedes and precedesji, 
as for these the two time intervals are disjunct. Thus, we introduce a range 
parameter, Ti, which assigns to two intervals the type NoRelation if they are 
too far apart in time. Hence, we define: 

Definition 2 (TIME Relations) The set of fourteen fuzzy relations that 
can hold between any two elements from two segmentations, T\ and T 2 , based 
on the margin T\ and the range parameter Ti . 

Obviously the new relations still assure that between two intervals one and 
only one type of relation exists. The difference between standard Allen rela- 
tions and TIME relations is visualized in Fig. 2. 

Since TIME relations depend on two intervals, we choose one interval as a 
reference interval and compare this interval with all other intervals. Contin- 
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uing the example, when we choose a camera shot as a reference interval, the 
goal can be modelled by a swift camera pan that starts the current camera 
shot, excited speech that overlaps-i the camera shot, and a keyword in the 
closed caption that precedes-i the camera shot within a range of 6 seconds. 
This can be explained because of the time lag between actual occurrence 
of the event and its mentioning in the closed caption. By using TIME seg- 
mentations and TIME relations it now becomes possible to represent events, 
context, and synchronization in one common framework: 

Definition 3 (TIME Representation) Model of a multimedia pattern x 
based on the reference interval r re f , and represented as a set of n TIME 
relations, with d TIME segmentations. 

In theory, the number of TIME relations, n, is bounded by the number of 
TIME segmentations, d. Since, every TIME segmentation can be expressed 
as a maximum of fourteen TIME relations with the fixed reference interval, 
the maximum number of TIME relations in any TIME representation is equal 
to 14(d — 1). In practice, however, a subset can be chosen, either by feature 
selection techniques (Jain et al. (2000)), experiments, or domain knowledge. 

With the TIME representation we are able to combine layout and content 
elements into a common framework. Moreover, it allows for proper modelling 
of synchronization and inclusion of context as they can both be expressed as 
time intervals. 

2.2 Pattern classification 

To learn the relation between a semantic event to, and corresponding pattern 
x , we exploit the powerful properties of statistical classifiers. In standard pat- 
tern recognition, a pattern is represented by features. In the TIME framework 
a pattern is represented by related detector segmentations. In literature a var- 
ied gamut of statistical classifiers is proposed, see Jain et al. (2000). We will 
discuss three classifiers with varying complexity. We start with the C4.5 de- 
cision tree (Quinlan (1993)), then we proceed with the Maximum Entropy 
framework (Jaynes (1957), Berger et al. (1996)), and finally we discuss clas- 
sification using a Support Vector Machine (Vapnik (2000)). 



C4.5 Decision tree The C4.5 decision tree learns from a training set the 
individual importance of each TIME relation by computing the gain ratio 
(Quinlan (1993)). Based on this ratio a binary tree is constructed where 
a leaf indicates a class, and a decision node chooses between two subtrees 
based on the presence of some TIME relation. The more important a TIME 
relation is for the classification task at hand, the closer it is located near the 
root of the tree. Because the relation selection algorithm continues until the 
entire training set is completely covered, some pruning is necessary to prevent 
overtraining. Decision trees are considered suboptimal for most applications 
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(Jain et al. (2000)). However, they form a nice benchmark for comparison 
with more complex classifiers and have the added advantage that they are 
easy to interpret. 



Maximum Entropy Whereas a decision tree exploits individual TIME re- 
lations in a hierarchical manner, the Maximum Entropy (MaxEnt) framework 
exploits the TIME relations simultaneously. In MaxEnt, first a model of the 
training set is created, by computing the expected value, E tra in, of each TIME 
relation using the observed probabilities p(x, uj) of pattern and event pairs, 
(Berger et al. (1996)). To use this model for classification of unseen patterns, 
we require that the constraints for the training set are in accordance with 
the constraints of the test set. Hence, we also need the expected value of the 
TIME relations in the test set, E tes t- The complete model of training and 
test set is visualized in Fig. 3. We are left with the problem of finding the 
optimal reconstructed model, p* , that finds the most likely event ui given an 
input pattern cc, and that adheres to the imposed constraints. From all those 
possible models, the maximum entropy philosophy dictates that we select the 
one with the maximum entropy. It is shown by Berger et al. (1996) that there 
is always a unique model p*(ui\x) with maximum entropy, and that p*(u>\x) 
has a form equivalent to: 



1 " 

P*Hx) = zYla T / ix ’ u) ( 1 ) 

j = i 

where ay is the weight for TIME relation 7y and Z is a normalizing constant, 
used to ensure that a probability distribution results. The values for ay are 
computed by the Generalized Iterative Scaling (GIS) algorithm (Darroch and 
Ratcliff (1972)). Since GIS relies on both E tra in and E tes t for calculation of 
ctj, an approximation proposed by Lau et al. (1993) is used that relies only on 
Etrain- This allows to construct a classifier that depends completely on the 
training set. The automatic weight computation is an interesting property 
of the MaxEnt classifier, since it is very difficult to accurately weigh the 
importance of individual detectors and TIME relations beforehand. 



Support Vector Machine The Support Vector Machine (SVM) classifier 
follows another approach. Each pattern x is represented in a n-dimensional 
space, spanned by the TIME relations. Within this relation space an optimal 
hyperplane is searched that separates the relation space into two different 
categories, u>, where the categories are represented by +1 and —1 respectively. 
The hyperplane has the following form: w|(w'a; + 6)| > 1, where w is a weight 
vector, and b is a threshold. A hyperplane is considered optimal when the 
distance to the closest training examples is maximum for both categories. 
This distance is called the margin. Consider the example in Fig. 3. Here a 
two-dimensional relation space consisting of two categories is visualized. The 
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Fig. 3. (a) Simplified visual representation of the maximum entropy framework, 
(b ) Visual representation of Support Vector Machine framework in two dimensions. 
The optimal hyperplane is indicated as a thick solid line. 



solid bold line is chosen as optimal hyperplane because of the largest possible 
margin. The circled data points closest to the optimal hyperplane are called 
the support vectors. The problem of finding the optimal hyperplane is a 
quadratic programming problem of the following form (Vapnik (2000)): 

min{V w +c(5>)} (2) 

2=1 

Under the following constraints: 

w|(w ■ Xi + 6)1 > 1 - £i, £ = 1,2, . . . ,Z (3) 

Where C is a parameter that allows to balance training error and model 
complexity, l is the number of patterns in the training set, and are slack 
variables that are introduced when the data is not perfectly separable. These 
slack variables are useful when analyzing multimedia, since results of indi- 
vidual detectors typically include a number of false positives and negatives. 



3 Highlight event classification in soccer broadcasts 

Important events in a soccer game are scarce and occur more or less random. 
Examples of such events are goals, penalties, yellow cards, red cards, and 
substitutions. We define those events as follows: 

• Goal: the entire camera shot showing the actual goal; 

• Penalty: beginning of the camera shot showing the foul until the end of 
the camera shot showing the penalty; 

• Yellow card: beginning of the camera shot showing the foul until the end 
of the camera shot that shows the referee with the yellow card; 
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Table 1. TIME representation for soccer analysis. T 2 indicates the contextual range 
used by the precedes and precedesA relations. 



TIME segmentation 


TIME relations T 2 (s) 


Camera work 


during 




Person 


during 




Close-up 


precedesji 


0 - 40 


Goal keyword 


precedesji 


0 - 6 


Card keyword 


precedesji 


0 - 6 


Substitution keyword 


precedesji 


0 - 6 


Excitement 


All relations 


0 - 1 


Info block statistics 


precedesji 


20 - 80 


Person block statistics 


precedesji 


20 - 50 


Referee block statistics 


precedesji 


20 - 50 


Coach block statistics 


precedesji 


20 - 50 


Goal block statistics 


precedesji 


20 - 50 


Card block statistics 


precedesji 


20 - 50 


Substitution block statistics during 




Shot length 


during 





• Red card : beginning of the camera shot showing the foul until the end of 
the camera shot that shows the referee with the red card; 

• Substitution: beginning of the camera shot showing the player who goes 
out, until the end of the camera shot showing the player who comes in; 

Those events are important for the game and therefore the author adds con- 
textual clues to make the viewer aware of the events. For accurate detection 
of events, this context should be included in the analysis. 

Some of the detectors, used for the segmentation, are soccer specific. Other 
detectors were chosen based on reported robustness and training experiments. 
The parameters for individual detectors were found by experimentation using 
the training set. Combining all TIME segmentations with all TIME relations 
results in an exhaustive use of relations, we therefore use a subset, tuned 
on the training set, to prevent a combinatory explosion. For all events, all 
mentioned TIME segmentations and TIME relations are used, i.e. we used 
the same TIME representation for all events from the same domain. 

The teletext (European closed caption) provides a textual description of 
what is said by the commentator during a match. This information source 
was analyzed for presence of informative keywords, like yellow, red, card, 
goal, 1-0, 1-2, and so on. In total 30 informative stemmed keywords were 
defined for the various events. On the visual modality we applied several 
detectors. The type of camera work (Baan et al. (2001)) was computed for 
each camera shot, together with the shot length. A face detector by Rowley 
et al. (1998) was applied for detection of persons. The same detector formed 
the basis for a close-up detector. Close-ups are detected by relating the size 
of detected faces to the total frame size. Often, an author shows a close-up 
of a player after an event of importance. One of the most informative pieces 
of information in a soccer broadcast are the visual overlay blocks that give 
information about the game. We subdivided each detected overlay block as 
either info, person, referee, coach, goal, card, or substitution block (Snoek 
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and Worring (2003)), and added some additional statistics. For example the 
duration of visibility of the overlay block, as we observed that substitution 
and info blocks are displayed longer on average. Note that all detector results 
are transformed into binary output before they are included in the analysis. 
From the auditory modality the excitement of the commentator is a valuable 
resource. For the proper functioning of an excitement detector, we require 
that it is insensitive to crowd cheer. This can be achieved by using a high 
threshold on the average energy of a fixed window, and by requiring that an 
excited segment has a minimum duration of 4 seconds. 

We take the result of automatic shot segmentation as a reference interval. 
An overview of the TIME representation for the soccer domain is summarized 
in Table 1. In the next section we will evaluate the automatic indexing of 
events in soccer video, based on the presented pattern representation. 



4 Evaluation 

For the evaluation of the TIME framework we recorded 8 live soccer broad- 
casts, about 12 hours in total. We used a representative training set of 3 
hours and a test set of 9 hours. In this section we will first present the evalu- 
ation criteria used for evaluating the TIME framework, then we present and 
discuss the classification results obtained. 



4.1 Evaluation criteria 

The standard measure for performance of a statistical classifier is the error 
rate. However, this is unsuitable in our case, since the amount of relevant 
events are outnumbered by irrelevant pieces of footage. An alternative is to 
use the precision and recall measure adapted from information retrieval. This 
measure gives an indication of correctly classified highlight events, falsely 
classified highlight events, and missed highlight events. However, since high- 
light events in a soccer match can cross camera shot boundaries, we merge 
adjacent camera shots with similar labels. As a consequence, we loose our 
arithmetic unit. Therefore, precision and recall can no longer be computed. 
As an alternative for precision we relate the total duration of the segments 
that are retrieved to the total duration of the relevant segments. Moreover, 
since it is unacceptable from a users perspective that scarce soccer events are 
missed, we strive to find as many events as possible in favor of an increase 
in false positives. Finally, because it is difficult to exactly define the start 
and end of an event in soccer video, we introduce a tolerance value X3 (in 
seconds) with respect to the boundaries of detection results. We used a T 3 
of 7 s. for all soccer events. A merged segment is considered relevant if one 
of its boundaries plus or minus T3 crosses that of a labelled segment in the 
ground truth. 
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Table 2. Evaluation results of the different classifiers for soccer events, where 
duration is the total duration of all segments that are retrieved. 

Ground truth C4-5 MaxEnt SVM 



Total Duration Relevant Duration Relevant Duration Relevant Duration 



Goal 


12 


3 77 


l Q7 S 


2 


2 m 40 s 


10 


10 m 14 s 


11 


11 T 


"■52 s 


Yellow Card 


24 


10 T 


n 35 s 


13 


14 m 28 s 


22 


26 m 12 s 


22 


12 r 


n 31 £ 


Substitution 


29 


8 77 


l 09 s 


25 


15 m 27 s 


25 


7 m 36 s 


25 


7" 


l 23 s 


E 


65 


21 T 


n 51 s 


40 


32 m 35 s 


57 


44 m 02 s 


58 


31 T 


n 46 f 



4.2 Classification results 

For evaluation of TIME on the soccer domain, we manually labelled all the 
camera shots as either belonging to one of four categories: yellow card, goal, 
substitution, or unknown. Red card and penalty were excluded from analysis 
since there was only one instance of each in the data set. For all three re- 
maining events a C4.5, MaxEnt, and SVM classifier was trained. Results on 
the test set are visualized in Table 2. 

When analyzing the results, we clearly see that the C4.5 classifier performs 
worst. Although it does a good job on detection of substitutions, it is sig- 
nificantly worse for both yellow cards and goals when compared to the more 
complex MaxEnt and SVM classifiers. When we compare results of MaxEnt 
and SVM, we observe that almost all events are found independent of the 
classifier used. The amount of video data that a user has to watch before 
finding those events is about two times longer when a MaxEnt classifier is 
used, and about one and a half times longer when a SVM is used, compared 
to the best case scenario. This is a considerable reduction of watching time 
when compared to the total duration, 9 hours, of all video documents in the 
test set. With the SVM we were able to detect one extra goal, compared to 
MaxEnt. Analysis of retrieved segments learned that results of Maximum En- 
tropy and SVM are almost similar. Except for goal events, where nine events 
were retrieved by both, the remaining classified goals were different for each 
classifier. 

When we take a closer look to the individual results of the different clas- 
sifiers, it is striking that C4.5 can achieve a good result on some events, e.g. 
substitution, while performing bad on others, e.g. goal. This can, however, be 
explained by the fact that the events where C4.5 scores well, can be detected 
based on a limited set of TIME relations. For substitution events in soccer 
an overlay during the event is a very strong indicator. When an event is com- 
posed of several complex TIME relations, like goal, the relatively simple C4.5 
classifier performs worse than both complex MaxEnt and SVM classifiers. 

To gain insight in the meaning of complex relations in the soccer domains, 
we consider the GIS algorithm from section 2.2, which allows to compute the 
importance or relative weight of the different relations used. The weights 
computed by GIS indicate that for the soccer events goal and yellow card 
specific keywords in the closed captions, excitement with during and overlaps 
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Fig. 4. The Goalgle soccer video search engine. 



relations, a close-up afterwards, and the presence of an overlay nearby are 
important relations. 

Overall, the SVM classifier achieves comparable or better results than 
MaxEnt. When we analyze false positives for both classifiers, we observe 
that those are caused because some of the important relations are shared 
between different events. This mostly occurs when another event is indeed 
happening in the video, e.g. a hard foul or a scoring chance. False negatives 
are mainly caused by the fact that a detector failed. By increasing the number 
of detectors and relations in our model we might be able to reduce those false 
positives and false negatives. 

5 Conclusion 

To bridge the semantic gap for multimedia event classification, a new frame- 
work is required that allows for proper modelling of context and synchroniza- 
tion of the heterogeneous information sources involved. We have presented 
the Time Interval Multimedia Event (TIME) framework that accommodates 
those issues, by means of a time interval based pattern representation. More- 
over, the framework facilitates robust classification using various statistical 
classifiers. 

To demonstrate the effectiveness of TIME it was evaluated on the domain 
of soccer. We have compared three different statistical classifiers, with varying 
complexity, and show that there exists a clear relation between narrowness 
of the semantic gap and the needed complexity of a classifier. When there 
exists a simple mapping between a limited set of relations and the semantic 
concept we are looking for, a simple decision tree will give comparable re- 
sults as a more complex SVM. When the semantic gap is wider, detection 
will profit from combined use of multimodal detector relations and a more 
complex classifier, like the SVM. Results show that a considerable reduction 
of watching time can be achieved. The indexed events were used to build the 
Goalgle soccer video search engine, see Fig. 4. 
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Abstract. The concept of attributable risk ( AR ), introduced more than 50 years 
ago, quantifies the proportion of cases diseased due to a certain exposure (risk) 
factor. While valid approaches to the estimation of crude or adjusted AR exist, 
a problem remains concerning the attribution of AR to each of a set of several 
exposure factors. Inspired by mathematical game theory, namely, the axioms of 
fairness and the Shapley value, introduced by Shapley in 1953, the concept of partial 
AR has been developed. The partial AR offers a unique solution for allocating shares 
of AR to a number of exposure factors of interest, as illustrated by data from the 
German Gottingen Risk, Incidence, and Prevalence Study (G.R.I.P.S.). 



1 Introduction 

Analytical epidemiological studies aim at providing quantitative information 
on the association between a certain exposure, or several exposures, and 
some disease outcome of interest. Usually, the disease etiology under study 
is multifactorial, so that several exposure factors have to be considered si- 
multaneously. The effect of a particular exposure factor on the dichotomous 
disease variable is quantified by some measure of association, including the 
relative risk ( RR ) or the odds ratio (OR), which will be explained in the next 
section. 

While these measures indicate by which factor the disease risk increases if a 
certain exposure factor is present in an individual, the concept of attribut- 
able risk (AR) addresses the impact of an exposure on the overall disease load 
in the population. This paper focusses on the AR , which can be informally 
introduced as the answer to the question, “what proportion of the observed 
cases of disease in the study population suffers from the disease due to the ex- 
posure of interest?” . In providing this information the AR places the concept 
of RR commonly used in epidemiology in a public health perspective, namely 
by providing an answer also to the reciprocal question, “what proportion of 
cases of disease could - theoretically — be prevented if the exposure factor 
could be entirely removed by some adequate preventive action?” . 

Since its introduction in 1953 (Levin (1953)), the concept of AR is increas- 
ingly being used by epidemiological researchers. However, while the method- 
ology of this invaluable epidemiological measure has constantly been extended 
to cover a variety of epidemiological situations, its practical use has not fol- 
lowed these advances satisfactorily (reviewed by Uter and Pfahlberg (1999)). 
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One of the difficulties in applying the concept of AR is the question of how to 
adequately estimate the AR associated with several exposure factors of inter- 
est, and not just one single exposure factor. The present paper briefly intro- 
duces the concept of sequential attributable risk (SAR) and then focusses on 
the partial attributable risk (PAR), following an axiomatic approach founded 
on game theory. For illustrative purposes, data from a German cohort study 
on risk factors for myocardial infarction are used. 



2 Basic definitions of attributable risk 



Suppose a population can be divided into an exposed subpopulation (E = 1) 
and an unexposed one (E — 0), as well as a diseased part ( D = 1) and a non- 
diseased one (D = 1). Denote P(A) the probability that a randomly chosen 
subject from this population belongs to subpopulation A, and P(A\B) the 
corresponding conditional probability of A given B. 

The definition of the relative risk (RR) is then as follows: 



P(D = l\E = 1) 
P(D = 1\E = 0) ’ 



(1) 



Another, well-established measure of individual risk is the odds ratio (OR), 
which compares the odds of being diseased instead of the risk of being diseased 
between the exposed (E = 1) and the unexposed (E = 0) subpopulation: 



P(D = 1\E = 1)/P(D = 0\E = 1) 

‘ P(D = 1\E = 0)/P(D = 0\E = 0)' 

The definition of attributable risk, in contrast, is as follows (for more formal 
details see Eide and Heuch (2001)) : 



AR ■- 



P(D = l) - P(D = l\E = 0) 
P(D = 1) 



(3) 



Alternatively, the AR can be expressed in algebraically equivalent forms, as 
originally introduced by Levin (1953) as 



P(E = 1) * [RR - 1] 
P(E= 1) *[RR- 1] + 1‘ 



or, as defined by Miettinen (1974), 



(4) 



AR = P(E = 1\D = 1)*^^. (5) 

As can be seen from these definitions, the AR depends both on the individual 
risk (RR) and on the exposure prevalence (P(E = 1)): the larger the RR, 
the larger the AR will be, given a fixed P(E = 1), and the higher the expo- 
sure prevalence, the larger the AR will be, given a certain RR. Moreover, a 
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certain AR may result from different scenarios — a rare exposure associated 
with a high individual (relative) risk, or a common, but weak risk factor. 
Knowledge of the underlying scenario, and not only of the AR alone, is im- 
portant for public health decisions regarding intervention strategies: in the 
first case, a targeted approach aiming at the small, identifiable subgroup at 
high risk would be appropriate, while in the latter case, a “population strat- 
egy” offering intervention for nearly the whole population would be more 
advisable. 



3 Crude and adjusted attributable risk 

The maximum likelihood estimator of AR can be easily obtained in 2 x 2 tables 
from cohort and cross-sectional studies by substituting sample proportions 
for the respective probabilities in (1) leading to what has been termed crude 
estimators of the AR (Walter (1976)). Some additional approximations (i.e., 
replacing RR by OR in (3)) is necessary for the case-control design (Whit- 
temore (1982), Benichou (1991)). 

However, often we face a multifactorial etiology of disease, some of these 
factors being potential confounders for the impact of one certain factor of 
interest. In this situation, crude estimates of the AR derived from a D x E 
contingency table will be biased. If only one exposure factor is of interest in 
terms of AR estimation, confounding of this estimate can be overcome by 
calculating an adjusted AR: 



ARadj • 



P(D = l)-'E i P{D = l\E = 0,C 
P{D = 1) 



*) 



( 6 ) 



where C denotes the stratum variable formed by the combination of all other 
L exposures considered as nuisance factors. This AR adjusted for the to- 
tal effect of all L nuisance factors may be interpreted as the proportion of 
the diseased population that is potentially preventable if the risks of disease 
in the exposed sub-populations were changed to the risks of the unexposed 
( E = 0) population in all C strata of the adjusting variables. Estimation of 
the adjusted AR based on stratification methods (Gefeller (1992)) or on a 
logistic regression approach (Bruzzi et al. (1985)) have been investigated in 
detail some time ago already. 



Such an approach, however, is only reasonable whenever the specific aim 
of the study is to evaluate the role of only one particular exposure factor. 
Otherwise, the implicit hierarchy imposed on the variables involved in the 
calculation is not justified and another approach to AR estimation is required 
(Gefeller and Eide (1993)). 
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4 Sequential attributable risk 

As a first step to overcome the limited usefulness of adjusted AR estimates 
when dealing with the problem of quantifying several ARs of several expo- 
sure factors of interest, the sequential AR ( SAR ) has been suggested. The 
idea behind the SAR is to consider sequences of exposure variables of inter- 
est and quantify the additional effect of one exposure on disease risk after 
the preceding variables have already been taken into account. For didactic 
purposes, the approach is outlined below in its basic form ignoring for the 
moment any hierarchy or grouping of exposure variables as well as additional 
nuisance variables used exclusively for the purpose of controlling confounding 
(Eide and Gefeller (1995)). 

Suppose a total of the K + 1 exposure classes are generated by L exposure 
factors each with Ki + 1 exposure categories, i.e., K + 1 = + Our 

interest lies in the potential reduction of cases when preventing the L expo- 
sures, one at a time, in a given sequence, for instance starting with exposure 
no. 1, then exposure no. 2, and so on, until all L exposures are eliminated 
in the population. A reasonable way to accomplish this will be first to cal- 
culate the adjusted AR as shown in the previous section with all exposure 
factors but the first one included among the adjusting variables. This re- 
sults in an adjusted AR denoted by AR^l derived from a situation with 

L ( 2 ) 

IIi= 2 {K t + 1) strata and K\ + 1 exposure classes. Thereafter, define AR' a J- as 

the adjusted attributable risk calculated for the combined effect of first and 
second exposure variable (creating (K\ + 1) * (iv 2 + 1) exposure classes), and 
the remaining exposures, including the adjustment variables (confounders) 
forming the + 1) strata. This stepwise procedure of calculating ad- 

justed AR for different sets of exposure variables can be continued until all L 
exposures are incorporated among the exposure classes generating variables. 
The last term of this sequence corresponds to the total population 

impact of all L exposures. 

Any difference AR ^ — AR ^ , p < r, p,r £ A/”, describes the additional 
effect of considering the ( p + l)st, (p + 2 )nd, ..., r — th exposure after having 
previously taken into account the effect of the first p exposures in the spec- 
ified sequence. These differences may be called sequential attributable risks 
(SAR). Notice that the SAR of a specific exposure factor may differ even for 
the same set of L exposures according to the sequence of exposure variables 
considered during the stepwise process of calculation. Hence, the SAR de- 
pends on the ordering within the sequence and is not unique for an exposure 
(for an illustrative example of this property see Gefeller et al. (1998)). Thus, 
the problem of an unambiguous quantification of the contribution of one ex- 
posure to the disease load on a population in a multifactorial situation under 
the assumption of quasi equal-ranking of factors remains, but in situations 
where, e.g., a specific sequence of exposure factors targeted in a prevention 
campaign is given the SAR can be of intrinsic interest (Rowe et al. (2004)). 
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5 Partial attributable risk 

As a solution of the problem of ambiguity, the partial AR (PAR) has been 
suggested. Originally, the idea has been proposed in a preliminary form by 
Cox (1987). The PAR is estimated in two steps: 

1. by deriving the joint attributable risk for all exposures E 1 , ..., E L under 
consideration, i.e., the AR for at least one of these exposure factors, and 
then 

2. additively partitioning this quantity into shares for each exposure E l 
using an appropriate allocation rule 

These resulting shares for E l are referred to as “partial attributable risk 
for E l ” PAR(E' 1 ). The development of an appropriate allocation rule has 
been inspired by game theory. A classical problem in game theory is the 
following: how can the (momentary) profit that several players have gained 
by cooperative action in varying coalitions be fairly divided among them? 
In 1953, Shapley developed a set of axioms for profit division which leads 
to a unique solution, the Shapley value. The Shapley value averages the 
contributions of single players to all possible coalitions and is still one of 
the most common methods of payoff allocation in game theory based on the 
following assumptions: 

• Efficiency: Entire value of each coalition must be paid out to members 

• Symmetry: Payoffs must be independent from order of players 

• Additivity: Payoff of the sum of two games must be the sum of two 
separate payoffs 

• Null player: Any player, whose value added to any coalition is 0, receives 
0 payoff in any coalition 



Table 1 . Comparison of game-theoretic and epidemiological setting 



Game theory 


Epidemiology 


Player P l 


Exposure E l 


Varying coalitions among 


Combinations of exposures 


P 1 p L 


among E 1 , ..., E L 


Profit 


Risk 


Fair division of profit 


Allocation of AR to 


among all players 


each exposure factor 


Shapley value: average of 


Partial AR: average of 


the contribution of a single 


all L\ sequential ARs 


player P z to all coalitions 


of an exposure factor E l 



While the problem of fair profit division of players, and of “fair” allocation of 
shares of AR to certain exposure factors bears striking similarity (Table 1), 
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the axioms have to be reformulated. In particular, the axiom of additivity 
with its clear meaning in typical game-theoretic applications with succes- 
sive games has no meaningful counterpart in epidemiological applications. 
Therefore, an algebraically equivalent set of axioms has been derived to be 
applicable in the epidemiological context including the following “properties 
of fairness” (Land and Gefeller (1997)): 

Marginal rationality ensures a consistent comparison of the population 
impact of one exposure factor E l with respect to separate (sub-) populations 
(denoted by superscripts / and II , respectively). If the AR associated with 
this risk factor E 1 is higher in subpopulation / than in subpopulation II 
for all combinations with other exposure factors under study, then the at- 
tributable share allocated to E l in subpopulation I should be larger than in 
subpopulation II. More formally: 

AR Z (S U E ‘) - AR Z (S) > AR II (S U E l ) - AR II {S),VS C {E 1 , ..., E L }\{E i } 

=> PAR 1 (E i ) > PAR 11 (E i ) 

Internal marginal rationality ensures a consistent comparison of different 
exposure factors concerning their respective impact on the disease load in one 
population, i.e. , if the AR associated with a certain risk factor E l is larger 
than the AR associated with another risk factor E k in all corresponding 
combinations with other exposure factors under study in a given population, 
then PAR(E l ) should also be larger than PAR(E k ). More formally: 

AR I (S\JE i ) > AR 1 (S U E k ),\/S C {E 1 , E L }\{E i , E k } 

=> PAR{E i ) > PAR(E k ) 

Symmetry ensures that the method used for dividing up the joint AR among 
the L different exposure factors is not influenced by any ordering among the 
variables. While SARs are not symmetrical, as pointed out above, the PAR 
is symmetrical by virtue of the averaging process. 

Finally, there is exactly one way of partitioning the joint attributable risk for 
L exposure factors into L single components, which then sum up to the joint 
AR, which satisfies both marginal rationality and internal marginal rational- 
ity as well as symmetry (Land and Gefeller (1997)). Recently, extensions of 
the concept of PAR have been introduced to address the situation of grouped 
(hierarchical) exposure variables. In this situation, a “top down” approach of 
first deriving the PARs for the group variables and then further subdividing 
these into shares for each single exposure factors must be followed (Land et 
al. (2001)). 

6 Illustrative example: The G.R.I.P.S. Study 

Data of the G.R.I.P.S. study (Gottingen Risk, Incidence, and Prevalence 
Study), a cohort study with 6029 male industrial workers aged 40 to 60, 
designed to analyze the influence of potential risk factors for myocardial in- 
farction (Cremer et al. (1991)), are used to illustrate the different concepts of 
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attributable risk. For our purposes, we focus on the effect of the three lipopro- 
tein fractions (LDL-, VLDL- and HDL-cholesterol) and cigarette smoking as 
exposures of interest, while controlling for age, familiar disposition, alco- 
hol consumption, blood pressure and glucose level as potential confounders. 
All analyzes were performed with the SAS software package. Table 2 shows 
estimates for crude AR (derived from the corresponding 2x2 tables), ad- 
justed AR (based on a logistic regression analysis) and partial AR. Note that 
estimates of precision are omitted. From the comparison of crude and ad- 



Table 2. Crude, adjusted and partial AR for exposure factors in G.R.I.P.S. 



Exposure 

factor 


Definition of 
“unexposed” 


Crude 

AR 


Adjusted 

AR 


Partial 

AR 


LDL-cholesterol 


< 160mg/dl 


0.612 


0.577 


0.396 


HDL-cholesterol 


> 35mg/dl 


0.204 


0.172 


0.072 


VLDL-cholesterol 


< 30mg/dl 


0.217 


0.167 


0.067 


Smoking 


nonsmoker 


0.371 


0.370 


0.234 


Total effect of all 4 factors 
(joint AR) 




0.803 


0.769 


0.769 



justed values it is evident that some confounding of the relationship between 
lipoprotein exposure variables and the outcome is present, while this is not 
the case for cigarette smoking. Moreover, in the situation of the G.R.I.P.S. 
study the PAR for each exposure variable is generally much lower than the 
corresponding crude and adjusted AR. Due to their construction the PARs 
given in table 2 reveal an additive property, i.e. , the sum of all PAR values 
equals the total effect of all four exposures measured by the corresponding 
adjusted AR, according to expression (4) (adjusted for the set of five other, 
confounding variables quoted above) . Consequently, in all situations the sum 
of PAR values cannot exceed the natural limit of one, which must be regarded 
as a strong advantage with respect to the interpretation of this measure in a 
multifactorial setting. 

7 Conclusion 

The estimation of attributable risks from epidemiological data forms an in- 
tegral part of modern analytical approaches quantifying the relationship be- 
tween some binary disease variable and a set of exposure factors. Whereas 
the relative risk quantifies the impact of exposure factors on an individual 
level, the AR addresses the impact on a population level. The multifactorial 
situation usually encountered in epidemiological studies should be reflected in 
the definition of these risk parameters. The definition of partial attributable 
risks incorporates the multifactorial nature of the attribution problem and 
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offers a solution to the task of assigning shares for several exposure factors. 
Further methodological research will address interval estimation of the PAR 
to promote its utilization in practical epidemiological studies. 
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Abstract. This paper deals with improved measures of statistical accuracy for 
parameter estimates of latent class models. It introduces more precise confidence 
intervals for the parameters of this model, based on parametric and nonparametric 
bootstrap. Moreover, the label-switching problem is discussed and a solution to 
handle it introduced. The results are illustrated using a well-known dataset. 



1 Introduction 

The finite mixture model is formulated as follows. Let y = (yi,...,yn) de- 
note a sample of size n, with y * a J-dimensional vector. Each data point is 
assumed to be a realization of the random variable Y with S'-component mix- 
ture probability density function (p.d.f.) /(y*; p) = X) s =i ^sfs(yi',0 s ), where 
the mixing proportions n s are nonnegative and sum to one, 9 S denotes the pa- 
rameters of the conditional distribution of component s defined by 0 S ), 

and p = { 7r i , ..., irs-i, 0i, ...,0s}. In this paper, we focus on the case where 
S is fixed. Note that 7rs = 1 — YlsZi n s- The log-likelihood function of the fi- 
nite mixture model (i.i.d. observations) is £(p\y) = XiiLi l°g /(yd ^)> which 
is straightforward to maximize by the EM algorithm (Dempster et al. (1977)). 

This paper focuses on the following question: how accurate is the ML 
estimator of pi A natural methodology to answering this question is the 
bootstrap technique. Bootstrap analysis has been applied in finite mixture 
modeling mainly to compute uncertainty of parameters by bootstrap stan- 
dard errors ( e.g ., de Menezes (1999)). As a result of the difficulties of using 
likelihood ratio tests for testing the number of components of finite mix- 
tures, another application is the boostrapping of the likelihood ratio statistic 
(McLachlan and Peel (2000)). A full computation of bootstrap confidence 
intervals for finite mixture models has not been reported in the literature. In 
this paper we focus on the latent class model. 

The paper is organized as follows: Section 2 gives a short review of the 
bootstrap technique; Section 3 discusses specific aspects of bootstrap when 

* His research was supported by Fundagao para a Ciencia e Tecnologia Grant no. 
SFRH/BD/890/2000 (Portugal) and conducted at the University of Groningen 
(Population Research Centre and Faculty of Economics), The Netherlands. I 
would like to thank Jeroen Vermunt and one referee for their helpful comments 
on a previous draft of the manuscript. 
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applied to finite mixture models; Section 4 illustrates the contributions for 
the latent class model (finite mixture of conditionally independent multino- 
mial distributions). Section 5 summarizes main results and needs for further 
research. 



2 Bootstrap analysis 



The bootstrap is a computer intensive resampling technique introduced by 
Efron (1979) for assessing among other things standard errors, biases, and 
confidence intervals, in situations where theoretical statistics are difficult to 
obtain. The bootstrap technique is easily stated. Suppose we have a random 
sample 2? from an unknown probability distribution F, and we want to esti- 
mate the parameter p = t(F). Let 5(2? , F) be a statistic. In order to infer, the 
underlying sampling distribution of 5(2?, F) has to be known. The bootstrap 
method estimates F by some estimate F based on 2?, giving a sampling distri- 
bution based on 5(2?*, F), where the bootstrap sample V* = (y(, y* 2 , ...,y*) 
is a random sample of size n drawn from F, and p* = S(T>*,F) is a boot- 
strap replication of p. The bootstrap uses a Monte Carlo evaluation of the 
properties of p, repeating sampling, say B times, from F to approximate the 
sampling distribution of p. The B samples are obtained using the following 
cycle: 

1. Draw a bootstrap sample 2 ? ^ = {y-* b \ 2 = 1,..., n}, y-*^ F; 

2. Estimate (p ^ = S(T>(* b \F) by the plug-in principle. 

For an overview of bootstrap methodology, we refer to Efron and Tibshirani 
(1993). The quality of the approximation depends on the value of B and how 
close F is to distribution F. Efron and Tibshirani (1993, 13) suggest that 
typical values of B for computing standard errors are in the range from 50 to 
200. For example, Albanese and Knott (1994) used 100 replications. For con- 
fidence intervals, typical values of B are > 1000 (Efron (1987)). Because we 
wish to compute more precise confidence intervals for finite mixture models, 
there is no prior indication on the appropriate number of bootstrap sam- 
ples. We set B = 5000. The application of the bootstrap technique depends 
on the estimation of F (F) that can be parametric or nonparametric. Para- 
metric bootstrap assumes a parametric form for F ( F par ) and estimates the 
unknown parameters by their sample quantities. That is, one draws B sam- 
ples of size n from the parametric estimate of the function F. Nonparametric 
bootstrap estimates F ( F nonpar ) by its nonparametric maximum likelihood 
estimate, the empirical distribution function which puts equal mass 1/n at 
each observation. Then, sampling from F means sampling with replacement 
from 2?. In our analyses, we compare results from the nonparametric (NP) 
and parametric (PAR) versions of the bootstrap. 
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After generating B bootstrap samples and computing bootstrap (ML) 
estimates b = 1 standard errors, bias, and confidence in- 

tervals for p can easily be computed (Efron and Tibshirani (1986)). Bias 
and standard error of p r are given by bias* (p r )=pi*' > — p r and a*(p r )= 

\J -g^T l - pi*^ respectively, with pi*) = ± J2b=i <Pr* b \ where r 

indexes the elements of vector p. 

Standard errors are a crude but useful measure of statistical accuracy, and 
are frequently used to give approximate confidence intervals for an unknown 
parameter p r given by p r ± a r z^ a \ where p r and ciy are the ML estimate 
and the estimated standard error of p r respectively, and is the 100 x a 
percentile point of a standard normal variate. In our analyses, we compare 
this approximation using the asymptotic standard error (ASE), the nonpara- 
metric bootstrap standard error (BSE_NP), and the parametric bootstrap 
standard error (BSEJPAR). 

The percentile method takes a direct (1 — a) 100% bootstrap confidence 
interval using the empirical a/2— and (1 — a/2)— quantiles of the boot- 
strap replicates. The BC a confidence interval improves precision by cor- 
recting for bias and nonconstant variance (skewness) and is especially im- 
portant for asymmetric distributions (Efron (1987); Efron and Tibshirani 
(1993)). The confidence interval for p T with endpoints a/2 and (1 — a/2) is 
((a)bc„ 0/ 2 )> (a)bc„ (! - a / 2 )). with 



( t Pr) BC a — G \Zq + 



Zo ’ 



-(a) 



1 - a{z 0 + zM) 



where G is the bootstrap cumulative density function (c.d.f.), is the c.d.f. of 
the normal distribution, and zo = ^ -1 {G(</y)} roughly measures the median 
bias of (p r , i.e., the discrepancy between the median of the bootstrap dis- 
tribution of p r and p r (Efron and Tibshirani (1993), 185). The acceleration 
value a (for nonconstant variance) can be estimated using jackknife values 
(for details, see Efron and Tibshirani (1993), 186). Setting a = 0, yields 
the BC confidence interval that corrects bias. Confidence intervals by the 
percentile method are simpler to compute from the bootstrap distribution 
by ^G _1 (a/2), G _1 (l — a/2)^, but may be less precise, since they assume 
So = a = 0 . 



3 Bootstrap analysis in finite mixture models 

The complex likelihood function of finite mixture models adds extra diffi- 
culties in implementing the bootstrap method due to local optima and non- 
identifiability. 

For estimating the parameters of the finite mixture model (p(* b )), one 
needs to use an iterative process. The EM algorithm is an elegant alternative, 
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but its success depends on different issues such as starting values. Because 
the original sample V and the replicated sample V f* 6 - 1 may not differ too 
much, McLachlan and Peel (2000) suggest the use of the maximum likelihood 
estimate of ip from I? as a starting value. As stopping rule we set an absolute 
difference of two successive log- likelihood values smaller than 10 ~ 6 . 

The likelihood function of the finite mixture model is invariant under 
permutations of the S components, i. e., rearrangement of the component in- 
dices will not change the likelihood (label-switching). In bootstrap analysis 
as well as Bayesian analysis by Markov chain Monte Carlo (MCMC) tech- 
niques, a permutation of the components may occur, resulting in the dis- 
tortion of the distribution of quantities of interest. One way of eliminating 
this non-identifiability is to define a natural order for each bootstrap sample, 



based, for example, on 7^*^ < < ... < Tr'g U> , 6 = 1 ,...,B, commonly 

utilized in Bayesian analysis. We refer to this procedure as the Order strat- 
egy. However, it has been shown for Bayesian analysis that also this method 
can distort the results. Stephens (2000) suggests relabeling or reordering the 
classes based on the minimization of a function of the posterior probabilities 



(*b) 



(*b) 



-l 



It has been shown that 



_,(*&) _ _(* 6 ) r t, T . o(* b )\ _(* b ) f l-xr ■ a(* b )\ 

a is — Ks Js{yi,Vs ) n h Jh{yu“h ) 

this method performs well in comparison with other label-switching methods 
in the Bayesian setting (Dias and Wedel (2004)). Let v^ b )(<p(* b ' > ) define a per- 
mutation of the parameters for the bootstrap sample b , and Q( b_1 ) = (g-^ -1 ^) 
be the bootstrap estimation of a = (aj S ), based on the previous 6—1 boot- 
strap samples. The algorithm is initialized with a small number of runs, say 

B*: Q ( °) = ^Em=i“!r*)- Then, for the 6th bootstrap sample, choose 
V(*b) t° minimize the Kullback-Leibler (KL) divergence between the poste- 



rior probabilities dj S {f(*b)(<^* b ))}, and the estimate of the posterior prob- 
abilities Q^ -1 ), and compute Q^. For computational details, we refer to 
Stephens (2000). In MCMC, as a result of the underlying Markov chain, la- 
bel switching happens less often than in independent situations such as the 
bootstrap resampling. Therefore, an initial estimate using a small number of 
B* bootstrap estimates (without taking into account label switching) may 
not be appropriate, and a better solution is to take Q(°) as the MLE solution. 
This strategy of dealing with the label switching is referred to as KL. 



4 An application to the latent class model 

The finite mixture of conditionally independent multinomial distributions, 
also known as a latent class (LC) model, has become a popular technique 
for clustering and subsequent classification of discrete data (Vermunt and 
Magidson (2003)). For binary data, let Y) have 2 categories, i.e., yij £ {0, 1}. 
The latent class model with S latent classes for y, is defined by the density 
f s {yi’, 9 S ) = UU ~ 9 S j) 1 ~ Vtj , where 9 S j denotes the parameters of the 

conditional distribution of component s, 9 S j = P{Yij = 1 | Zi = s), i.e., 
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Fig. 1 . Effect of the label-switching strategy. 



the probability that observation i belonging to component s has category 1 
(success) in variable j. This definition assumes conditional independence of 
the J manifest variables given the latent variable. The estimation of the LC 
model is straightforward by the EM algorithm. 

This application uses the well-known Stouffer-Toby dataset, which has 
been also used by others (e.g., Albanese and Knott (1994)). It corresponds 
to 216 observations with respect to whether they tend toward particularistic 
or universalistic values when confronted by each of four different role conflict 
situations. We set universalist values as the reference category, and reported 
conditional probabilities (9 S j) refer to particularistic values. We set S' = 2 
(identified model). 

We started the EM algorithm 10 times with random values of the para- 
meters tp from the uniform distribution on [0, 1] for each bootstrap sample. 
Comparing to starting the EM algorithm from the MLE solution, we con- 
cluded that differences are very small, and starting with the MLE solution 
works well for parametric and nonparametric bootstrap. 

Figure 1 depicts the histogram and kernel density estimation of the non- 
parametric bootstrap distribution of 7Ti and 612 for order and KL strategies. 
For 7Ti, the order strategy truncates the distribution at 0.5, forcing the rela- 
beling of the components, whereas the KL strategy relabels the components 
respecting the geometry of the distribution, allowing values above 0.5. In this 
application, the effect of the order strategy is small, since only 0.64% of the 
bootstrap estimates of 7Ti are above 0.5, and the bootstrap estimates of 7Ti by 
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both procedures have similar values. However, even for a very small number 
of bootstrap samples, the effect on other parameter estimates can be seri- 
ous. For example, Figure 1 shows the distribution of the bootstrap estimates 
of 0i2- As can be seen, the order strategy in which 7Ti is truncated at 0.5 
creates multimodality in the distribution of d\2 . This leads to a serious over- 
estimation of the standard error and the confidence interval for #12. Results 
presented below are, therefore, based on KL relabeling. 

Table 1 reports ML estimates (MLE) , the bootstrap mean (BMean) , boot- 
strap median (BMedian), and bootstrap bias (Bias) for nonparametric and 
parametric estimates. Though most of the parameter estimates present some 
bias, it is somewhat larger for 023 - Bootstrap means and medians tend to 
be similar, which may indicate similar symmetry of the bootstrap distribu- 
tions. From the comparison of parametric and nonparametric estimates, we 
conclude that differences are small. 



Table 1. ML estimates, bootstrap mean and median, and bias 





MLE 


BMean 
NP PAR 


BMedian 
NP PAR 


Bias 

NP PAR 


Class 


1 














7T1 


0.279 


0.295 


0.289 


0.289 


0.285 


0.015 


0.009 


#11 


0.007 


0.015 


0.015 


0.006 


0.005 


0.008 


0.008 


#12 


0.074 


0.082 


0.079 


0.076 


0.076 


0.009 


0.006 


#13 


0.060 


0.070 


0.068 


0.062 


0.062 


0.010 


0.008 


#14 


0.231 


0.233 


0.228 


0.237 


0.233 


0.002 


-0.003 


Class 


2 














7T2 


0.721 


0.706 


0.711 


0.711 


0.715 


-0.015 


-0.009 


#21 


0.286 


0.291 


0.288 


0.289 


0.287 


0.005 


0.002 


#22 


0.646 


0.654 


0.652 


0.652 


0.652 


0.007 


0.006 


#23 


0.646 


0.679 


0.676 


0.677 


0.675 


0.033 


0.030 


#24 


0.868 


0.876 


0.875 


0.876 


0.874 


0.008 


0.007 



Table 2 presents standard errors and respective 95 % confidence intervals. 
Note that 712 is not a free parameter, and so asymptotic results are not defined 
for it. We concluded that standard errors are, in general, similar, however, 
with slight differences. The relation between them is difficult to generalize. 
We observe that for #n, 6*12, and #13 the normal approximation does not give 
accurate results as a consequence of the symmetry of the interval close to 
the boundary of the parameter space. Another approximation results from 
applying the logit transformation to the probability parameters defined on 
[ 0 , 1 ], i.e. log [v? r /( 1 — tp r )\ = ip r , ip r £ (—00, 00). As an attempt to improve 
this approximation when bootstrap is applied, one may transform every boot- 
strap estimate and ML estimates to the logit scale, compute the confidence 
interval on the logit scale, and finally apply the inverse transformation. We 
concluded that the logit scale gives poor results. For example, for 7Ti the 95 % 
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confidence interval is (0.180, 0.700) and (0.201, 0.671) for nonparametric and 
parametric bootstrap respectively. 



Table 2. Standard errors and 95% confidence intervals 
Standard error Normal approximation 



Asymp. NP PAR Asymptotic BSE (NP) BSE (PAR) 



Class 1 


7T1 


0.056 


0.061 0.046 


(0.169, 0.389) 


(0.160, 0.398) 


(0.190, 0.369) 


$n 


0.025 


0.021 0.021 


(-0.043, 0.057) 


(-0.034, 0.047) 


(-0.034, 0.047) 


$12 


0.064 


0.064 0.058 


(-0.052, 0.199) 


(-0.052, 0.199) 


(-0.040, 0.187) 


$13 


0.065 


0.060 0.056 


(-0.067, 0.187) 


(-0.057, 0.177) 


(-0.050, 0.171) 


$14 


0.093 


0.100 0.094 


(0.049, 0.413) 


(0.036, 0.426) 


(0.046, 0.416) 


Class 2 


7T2 


- 


0.061 0.046 


- 


(0.602, 0.840) 


(0.631, 0.810) 


$21 


0.039 


0.044 0.040 


(0.209, 0.363) 


(0.201, 0.372) 


(0.208, 0.365) 


$22 


0.048 


0.049 0.049 


(0.552, 0.740) 


(0.550, 0.742) 


(0.551, 0.741) 


$23 


0.049 


0.052 0.049 


(0.550, 0.742) 


(0.544, 0.748) 


(0.550, 0.742) 


$24 


0.038 


0.038 0.037 


(0.793, 0.942) 


(0.793, 0.942) 


(0.796, 0.939) 



The percentile method and BC a do not impose the symmetry condition 
of the previous approximations and respect the parameter space (Table 3). 
The value of a (not shown) is relatively larger (absolute value) for $n, $i 2 , 
$13, and $24- The larger skewness of the bootstrap distributions of $n, $12, 
$ 13 , and $24 leads to a larger correction undertaken by the BC a confidence 
interval for these parameters. 



Table 3. Bootstrap 95% confidence intervals 



Percentile method BC a 

NP PAR NP PAR 



Class 


1 
















7T1 


(0.192, 


0.430) 


(0.210, 


0.389) 


(0.177, 


0.393) 


(0.199, 


0.371) 


$n 


(0.000, 


0.071) 


(0.000, 


0.070) 


(0.000, 


0.082) 


(0.000, 


0.083) 


$12 


(0.000, 


0.225) 


(0.000, 


0.205) 


(0.000, 


0.240) 


(0.000, 


0.211) 


$13 


(0.000, 


0.207) 


(0.000, 


0.191) 


(0.000, 


0.217) 


(0.000, 


0.201) 


$14 


(0.008, 


0.426) 


(0.015, 


0.401) 


(0.001, 


0.420) 


(0.021, 


0.403) 


Class 


2 
















7T2 


(0.570, 


0.808) 


(0.611, 


0.790) 


(0.607, 


0.823) 


(0.629, 


0.801) 


$21 


(0.213, 


0.383) 


(0.212, 


0.371) 


(0.210, 


0.379) 


(0.212, 


0.371) 


$22 


(0.562, 


0.755) 


(0.561, 


0.750) 


(0.544, 


0.734) 


(0.547, 


0.735) 


$23 


(0.584, 


0.787) 


(0.582, 


0.774) 


(0.572, 


0.765) 


(0.570, 


0.761) 


$24 


(0.803, 


0.951) 


(0.804, 


0.948) 


(0.773, 


0.928) 


(0.784, 


0.930) 
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5 Conclusion 

This paper proposed and described improved measures for estimation of the 
statistical accuracy of finite mixture model parameters. To our knowledge, 
for the first time more precise confidence intervals for the latent class model 
were computed, avoiding approximations with asymptotic standard errors, or 
using bootstrap standard errors coupled with normal approximations. Our 
comparison shows the improvement provided by full bootstrap confidence 
intervals, namely the BC a confidence interval. We observed in the application 
similar results for the parametric and nonparametric bootstrap. 

Furthermore, we showed that label-switching strategies are needed to han- 
dle the non-identifiability of component labels of finite mixture models. We 
introduced an adaptation of the Stephens method to the bootstrap method- 
ology that alleviates the effect of hard constraints and respects the geometry 
of the bootstrap distributions. 

Future research could extend our findings to other finite mixture models 
such as finite mixture of generalized linear models. 
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Abstract. Significant improvement of classification accuracy can be obtained by 
aggregation of multiple models. Proposed methods in this field are mostly based 
on sampling cases from the training set, or changing weights for cases. Reduction 
of classification error can also be achieved by random selection of variables to the 
training subsamples or directly to the model. In this paper we propose a method of 
feature selection for ensembles that significantly reduces the dimensionality of the 
subspaces. 



1 Introduction 

Combining classifiers into an ensemble is one of the most interesting recent 
achievements in statistics aiming at improving accuracy of classification. Mul- 
tiple models are built on the basis of training subsets (selected from the train- 
ing set) and combined into an ensemble or a committee. Then the component 
models determine the predicted class. 

Combined classifiers work well if the component models are “weak” and 
diverse. The term “weak” refers to poorly performing classifiers, that have 
high variance and low complexity. The diversity of base classifiers is obtained 
by using different training subsets, assigning different weights to instances or 
selecting different subsets of features. 

Examples of the component classifiers are: classification trees, nearest 
neighbours, and neural nets. 

A number of aggregation methods have been developed so far. Some are 
based on sampling cases from the training set while others use systems of 
weights for cases and combined models, or choosing variables randomly to 
the training samples or directly to the model. 

Selecting variables for the training subsamples is the projection of cases 
into the space of lower dimensionality to the original space. Therefore, reduc- 
tion of the number of dimensions of the subspaces is an important problem 
in statistics. 

In sections 1-4 of this paper we give a review of model aggregation and 
feature selection methods for ensembles. Then in section 5 we propose a new 
method of feature selection for combined models that significantly reduces the 
classification error. Section 6 contains a brief description of related work in 
correlation-based feature selection. Results of our experiments are presented 
in section 7. The last section contains a short summary. 
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Fig. 1 . The aggregation of classification models. 



2 Model aggregation 

Given a set of training instances: 

T = {(xi, yi), (x 2 , y 2 ), • • ■ , (x n , y n )}, (1) 

we form a set of training subsets: T\,T 2 , . . . ,Tm and a classifier is fitted to 
each of them, resulting in a set of base models: Ci, C 2 , . . . , Cm- Then they 
are combined in some way to produce the ensemble C* . When component 
models are tree-based models the ensemble is called a forest. Figure 1 shows 
process of combining classification models. 

Several variants of aggregation methods have been developed that differ 
in two aspects. The first one is the way that the training subsets are formed 
on the basis of the original training set. Generally three approaches are used: 

• Manipulating training examples: Windowing (Quinlan (1993)); Bagging 
(Breiman (1996)); Wagging (Bauer and Kohavi (1999)); Boosting (Freund 
and Shapire (1997)) and Arcing (Breiman (1998)). 

• Manipulating output values: Adaptive bagging (Breiman (1999)); Error- 
correcting output coding (Dietterich and Bakiri (1995)). 

• Manipulating features (predictors): Random subspaces (Ho (1998)); Ran- 
dom split selection (Amit and Geman (1997)), (Dietterich (2000)); Ran- 
dom forests (Breiman (2001)). 

The second aspect is the way that the outputs of base models are combined 
for the aggregate C*(x). There are three methods: 

• Majority voting (Breiman (1996)), when the component classifiers vote 
for the most frequent class as the predicted class: 

M 

Y 7 <An( x ) = v) 

m — 1 



C*(x) = argmax ye Y 



(2) 
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• Weighted voting (Freund and Schapire (1997)), where predictions of base 
classifiers are weighted. For example in boosting the classifiers with lower 
error rate are given higher weights: 



C*(x) 



argmax yeY 




1 /(C' m (x) = y) 



K m — 1 



(3) 



where: a m = log ^ 1 e £m j , and e m is the error rate of the classifier C m . 

• Stacked generalisation (Wolpert (1992)) that also uses a system of weights 
for component models: 



M 

C* (x) = ^2 w m C m (x), (4) 

m=l 



where: w m = argmin w J^iLi {yi ~ E^=i u, mC' m *( x i)} ■ The models 

C~ l (x) are fitted to training samples U~ l obtained by leave-one-out cross- 
validation (i.e. with i-th observation removed). 



3 Random Subspace Method 

Ho (1998) introduced a simple aggregation method for classifiers called “Ran- 
dom Subspace Method” (RSM). Each component model in the ensemble is 
fitted to the training subsample containing all cases from the training set but 
with randomly selected features. Varying the feature subsets used to fit the 
component classifiers results in their necessary diversity. 

This method is very useful, especially when data are highly dimensional, 
or some features are redundant, or the training set is small compared to the 
data dimensionality. Similarly, when the base classifiers suffer from the “curse 
of dimensionality” . 

The RSM uses a parallel classification algorithm in contrast to boosting or 
adaptive bagging that are sequential. It does not require specialised software 
or any modification of the source code of the existing ones. 

A disadvantage of the RSM is the problem of finding the optimal number 
of dimensions for random subspaces. Ho (1998) proposed to choose half of 
the available features while Breiman (2001) - the square root of the number 
of features, or twice the root. 

Figure 2 shows the classification error for the committee of trees built 
for the Satellite dataset (Blake et al. (1998)). The error has been estimated 
on the appropriate test set. Note that the error starts to rise up after quick 
decrease, while the number of dimensions of Random Subspaces increases. 

We propose to reduce the dimensionality of the subspaces by applying a 
feature selection to the initial number of variables chosen at random. 
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Number of features 

Fig. 2. Effect of the number of features on classification error rate. 

4 Feature selection for ensembles 

The aim of feature selection is to find the best subset of variables. There are 
three approaches to feature selection for ensembles: 

• filter methods that filter undesirable features out of the data before clas- 
sification, 

• wrapper methods that use the classification algorithm itself to evaluate 
the usefulness of feature subsets, 

• ranking methods that score individual features. 

Filter methods are the most common used methods for feature selection 
in statistics. We will focus on them in the next two sections. 

The wrapper methods generate sets of features. Then they run the clas- 
sification algorithm using features in each set and evaluate resulting models 
using 10-fold cross-validation. Kohavi and John (1997) proposed a stepwise 
wrapper algorithm that starts with an empty set of features and adds sin- 
gle features that improve the accuracy of the resulted classifier. Unfortu- 
nately, this method is only useful for data sets with relatively small number 
of features and very fast classification algorithms (e.g. trees). In general, the 
wrapper methods are computationally expensive and very slow. 

The RELIEF algorithm (Kira and Rendell (1992)) is an interesting exam- 
ple of ranking methods for feature selection. It draws instances at random, 
finds their nearest neighbors, and gives higher weights to features that dis- 
criminate the instance from neighbors of different classes. Then those features 
with weights that exceed a user-specified threshold are selected. 
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5 Proposed method 



We propose to reduce the dimensionality of random subspaces using a filter 
method based on Hellwig heuristic. The method is a correlation-based feature 
selection and consists of two steps: 

1. Iterate m=l to M: 

• Choose at random half of the data set features (L/ 2) to the training 
subset T m . 

• Determine the best subset F m of features in T m according to the 
Hellwig ’s method. 

• Grow and prune the tree using the subset F m . 

2. Finally combine the component trees using majority voting. 

The heuristic proposed by Hellwig (1969) takes into account both class- 
feature correlation and correlation between pairs of variables. The best subset 
of features is selected from among all possible subsets F\, F 2 , . . . , Fk (K = 
2 l — 1) that maximises the so-called “integral capacity of information”: 

Lm 

H{F m ) = J2 h mj , (5) 

j = i 



where L m is the number of features in the subset F m and h m j is the capacity 
of information of a single feature Xj in the subset F m : 



hmj — 



i + 

*7 *3 



Tin 



(6) 



In the equation (6) r C j is a class-feature correlation, and is a feature- 
feature correlation. 

The correlations r-ij are computed using the formula of symmetrical un- 
certainty coefficient (Press et al. (1988)) based on the entropy function E(x): 



r ij — 2 



E(xj) + E(xj) - E(xj,Xj) 
E(xi ) + E(xj ) 



(7) 



The measure (7) lies between 0 and 1. If the two variables are independent, 
then it equals zero, and if they are dependent, it equals unity. 

Continuous features have been discretised using the contextual technique 
of Fay y ad and Irani (1993). 



6 Related work 

Several correlation-based methods of feature selection for ensembles have 
been developed so far. We can assign them into one of the following groups: 
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simple correlation-based selection, advanced correlation-based selection, and 
contextual merit-based methods. 

Oza and Tumar (1999) proposed a simple method that belongs to the first 
group. It ranks the features by their correlations with the class. Then the L 
features of highest correlation are selected to the model. This approach is not 
effective if there is a strong feature interaction (multicollinearity) . 

The correlation feature selection (CFS) method developed by Hall (2000) 
is advanced because it also takes into account correlations between pairs of 
features. The set of features F m is selected to the model that maximizes the 
value: 



CFS(F m ) 



Fm Vc\ 

\J Fm T Z/ m (Z/ m 1) | Tij | 



(8) 



where f c is the average feature-class correlation and fy - the average feature- 
feature correlation. 

Hong (1997) developed a method that assigns a merit value to the feature 
Xi that is the degree to which the other features are capable of classifying 
the same instances as Xi. The distance between the examples and xj in 
the set of features F m is defined as: 



D . _ Vd (fc) 

k= 1 



(9) 



For the categorical feature Xk the component distance is: 



a ij ~ 



and for a continuous one it is: 



0 if x ki = x kj 

1 if Xk i 7^ Xkj 



a;„- = ' 



\Xki - x kj | 



(10) 



( 11 ) 



where t k is a feature-dependent threshold (i.e. the half of the range). 

The contextual merit of the feature x k is: 

N 

cM( Xk ) = Y / E ( l2 ) 

*=i jec(i) 



where Wij = 1 /D'f,- if xj is one of the /^-nearest counter examples to Xi and 
Wij = 0 otherwise. C(i) in equation (12) is the set of counter examples to x, 
(all instances not in the set of Xj). 



7 Experiments 

To compare prediction accuracy of ensembles for different feature selection 
methods we used 9 benchmark datasets from the Machine Learning Reposi- 
tory at the UCI (Blake et al. (1998)). 
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Results of the comparisons are presented in Table 1. For each dataset 
an aggregated model has been built containing M=100 component trees 1 . 
Classification errors have been estimated for the appropriate test sets. 



Table 1. Classification errors and dimensionality of random subspaces. 



Data set 


Single tree 
(Rpart) 


CFS 


New 

method 


Average number 
of features 
(new method) 


DNA 


6.40% 


5.20% 


4.51% 


12.3 


Letter 


14.00% 


10.83% 


5.84% 


4.4 


Satellite 


13.80% 


14.87% 


10.32% 


8.2 


Soybean 


8.00% 


9.34% 


6.98% 


7.2 


German credit 


29.60% 


27.33% 


26.92% 


5.2 


Segmentation 


3.70% 


3.37% 


2.27% 


3.4 


Sick 


1.30% 


2.51% 


2.14% 


6.7 


Anneal 


1.40% 


1.22% 


1.20% 


5.8 


Australian credit 


14.90% 


14.53% 


14.10% 


4.2 



8 Summary 

In this paper we have proposed a new correlation-based feature selection 
method for classifier ensembles that is contextual (uses feature intercorrela- 
tions) and based on the Hellwig heuristic. It gives more accurate aggregated 
models than those built with the CFS correlation-based feature selection 
method. The differences in classification error are statistically significant at 
the a = 0.05 level (two-tailed t-test). 

The presented method also considerably reduces the dimensionality of 
random subspaces. 
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Abstract. We address a current problem in industrial quality control, the detec- 
tion of defects in a laser welding process. The process is observed by means of a 
high-speed camera, and the task is complicated by the fact that very high sensi- 
tivity is required in spite of a highly dynamic / noisy background and that large 
amounts of data need to be processed online. In a first stage, individual images are 
rated and these results are then aggregated in a second stage to come to an overall 
decision concerning the entire sequence. Classification of individual images is by 
means of a polynomial classifier, and both its parameters and the optimal subset 
of features extracted from the images are optimized jointly in the framework of a 
wrapper optimization. The search for an optimal subset of features is performed 
using a range of different sequential and parallel search strategies including genetic 
algorithms. 



1 Introduction 

Techniques from data mining have gained much importance in industrial ap- 
plications in recent years. The reasons are increasing requirements of quality, 
speed and cost minimization and the automation of high-level tasks previ- 
ously performed by human operators, especially in image processing. Since 
the data streams acquired by modern sensors grow at least as fast as the 
processing power of computers, more efficient algorithms are required in spite 
of Moor’s law. 

The industrial application introduced here is an automated supervision 
of a laser welding process. A HDRC (High-Dynamic-Range-CMOS) sensor 
records a welding process on an injection valve. It acquires over 1000 frames 
with a resolution of 64x64 pixels per second. The aim is to detect welding 
processes which are characterized by sputter, i.e. the ejection of metal par- 
ticles from the keyhole, see Fig. 1. These events are rare and occur at most 
once in a batch of 1000 valves. Potential follow-up costs of a missed detection 
are high and thus a detection with high sensitivity is imperative, while a 
specificity below 100% is tolerable. 

The online handling and processing of the large amounts of raw data is 
particularly difficult; an analysis becomes possible if appropriate features are 
extracted which can represent the process. Of the large set of all conceivable 
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Fig. 1 . Top row, left: original frame (64 x 64) from laser welding process, show- 
ing a harmless perturbation which should not be detected. Middle: image of the 
estimated pixel-wise standard deviations, illustrating in which areas the keyhole is 
most dynamic. Right: pixels which exceed the expected deviation from the mean are 
marked. Large aggregations of marked pixels are merged to an “object hypothesis”. 
Bottom row: as above, but for original image showing a few sputter that should be 
detected. 

features, we should choose the ones that maximize the classification perfor- 
mance on an entire sequence of images. An exhaustive evaluation of all possi- 
ble combinations of both features and classifiers is usually too expensive. On 
the other hand, the recognition performance using a manually chosen feature 
set is not sufficient in most cases. An intermediate strategy is desired and 
proposed here: section 2 introduces a two-stage classification system which is 
optimized using the wrapper approach (section 3) while experimental results 
are given in section 4. 



2 Two-stage classification 

2.1 Motivation 

While the task is to evaluate the entire sequence of images, we have im- 
plemented a divide-and-conquer strategy which focuses on individual images 
first. In particular, we use a very conservative classifier on individual im- 
ages: even if there is only a weak indication of an abnormality, the presumed 
sputter is segmented from the background and stored as an object hypothesis. 
Evidence for a sputter is substantiated only if several such hypotheses appear 
in consecutive frames. 

The advantage of a simple classification in the first stage is the fast evalu- 
ation and adaptation of the classifier. The second stage aggregates classifica- 
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tions derived from individual images into an overall decision with increased 
reliability. 

2.2 First stage — object classification 

In the first stage, object hypotheses from single images are extracted and 
classified. 

In particular, an image of pixel- wise means and an image of pixel- wise 
standard deviations are computed from the entire sequence. Deviations from 
the mean, which are larger than a constant (e.g. £ [2.0, 4.0]) times the stan- 
dard deviation at that pixel are marked as suspicious (Brocke (2002), Hader 
(2003)). Sufficiently large agglomerations of suspicious pixels then become an 
object hypothesis Ot,i with indices for time t and object number i. Next, fea- 
tures such as area, eccentricity, intensity, etc. (Teague (1980)) are computed 
for all object hypotheses. 1 Based on these features, we compute (see section 
2.4) an index d(Ot,i) £ [0, 1] for membership of object hypothesis O t ,i in class 
“sputter” . 

2.3 Second stage — image sequence classification 

The first stage leaves us with a number of object hypotheses and their class 
membership indices. Sputters appear in more than one consecutive frame, 
whereas random fluctuations have less temporal correlation. The second stage 
exploits this temporal information by aggregating the membership indices 
into a single decision for the entire sequence as follows: for each frame, we 
retain only the highest membership index: dt := max^ d(Ot,i). If there is no 
hypothesis in a frame, the value is set to 0. The d t can be aggregated using 
a variety of functions. We use a sliding window located at time t, and apply 
the ^2, IL m i n operators to the indices d t , . . . , dt+ w -i to obtain aggregating 
functions a w (t). The length of the time window w is arbitrary, but should 
be no longer than the shortest sputter event in the training database. The 
largest value of the aggregate function then gives the decision index for the 
entire sequence, 

^sequence = maXtt^t) (1) 

If frequence exceeds a threshold <9, the entire sequence is classified as defective, 
otherwise as faultless. The optimum value for the threshold 0 depends on 
the loss function, see section 3. 

2.4 Polynomial classifier 

The choice of the classifier used in the first stage is arbitrary. We use the 
polynomial classifier (PC, Schiirmann (1996)) which offers a high degree of 

1 This list of features is arbitrarily expandable and previous knowledge on which 
(subset of) features are useful is not necessary, see section 3.1. 
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flexibility if sufficiently high degrees are used. Since it performs a least-squares 
minimization, the optimization problem is convex and its solution unique. 
Training is by solving a linear system of equations and is faster than that of 
classifiers like multilayer perceptrons or support vector machines (LeCun et 
al. (1995)), which is important in case the training is performed repeatedly 
such as a wrapper optimization (section 3.1). PCs have essentially only one 
free parameter, the polynomial degree. 

In the development stage, a tedious manual labeling of image sequences 
is required to assemble a training set. Based on an initial training set and 
the resultant classifier, further sequences can be investigated. The variance 
of predictions for single object hypotheses can be estimated and those for 
which a large variance is found can be assumed to be different from the ones 
already in the training set and added to it. In particular, under a number 
of assumptions (uncorrelatecl residuals with zero mean and variance a 2 ) the 
variance of a prediction can be estimated by a 2 x T (X T X)~ 1 x where X is 
the matrix of all explanatory variables (features and monomials formed from 
these) for all observations in the training set, and x is the new observation 
(Seeber and Lee (2003)). 



3 System optimization 



As stated above, sensitivity is of utmost importance in our application, while 
an imperfect specificity can be afforded. These requirements are met by opti- 
mizing the detection threshold 0 such that the overall cost is minimized. The 
losses incurred by missed detections or false positives are given by Lniojo 
and Lio,niOi respectively, with the former much larger than the latter. 

It is customary to arrange the loss function in a matrix as shown below: 



L = 



Lio.io Lio,nio 
Lniojo Lnio,nio 



Liojo = Lnio,nio = 0 , Liq,nio "C Lniojo 



The first index gives the true class, the second one the estimated class, with 
IO faultless, and NIO defective. The aim is to find a decision function which 
minimizes the Bayes risk r = E{L}. A missed NIO part makes for a large 
contribution to the risk r. 

The generalization error of a given feature subset and classifier is esti- 
mated from the bins that are held out in a fc-fold cross-validation (CV). 5- 
or 10-fold CV is computationally faster than leave-one-out and is a viable 
choice in the framework of a wrapper algorithm; moreover, these have per- 
formed well in a study by Breiman and Spector (1992). 



3.1 Wrapper approach 

We see great potential in the testing of different feature subsets. In earlier 
applications the filter approach (which eliminates highly correlated variables 
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or selects those that correlate with the response) was the first step in finding 
the relevant features. The filter approach attempts to assess the importance 
of features from the data alone. In contrast, the wrapper approach selects 
features using the induction algorithm as a black box without knowledge of 
feature context (Kohavi and John (1997)). The evaluation of a large number 
of different subsets of features with a classifier is possible only with computa- 
tionally efficient procedures such as the PC. We use the wrapper approach to 
simultaneously choose the feature subset, the polynomial degree G, the oper- 
ator in the aggregation function a, the window width w and the threshold 0. 
Evaluating a range of polynomial degrees 1, . . . , G is expensive; in section 3.3 
we show how PCs with degree < G can be evaluated at little extra cost. 

3.2 Search strategies in feature subsets 

The evaluation of all 2 n combinations of n individual features is usually 
prohibitive. We need smart strategies to get as close as possible to the global 
optimum without an exhaustive search. Greedy sequential search strategies 
are among the simplest methods, with two principal approaches, sequential 
forward selection (SFS) and sequential backward elimination (SBE). SFS 
starts with an empty set and iteratively selects from the remaining features 
the one which leads to the greatest increase in performance. Conversely, SBE 
begins with the complete feature set and iteratively eliminates the feature 
that leads to the greatest improvement or smallest loss in performance. Both 
SFS and SBE have a reduced complexity of 0(n 2 ). Both heuristics can miss 
the global optimum because once a feature is selected/eliminated, it is never 
replaced again. 

A less greedy strategy is required to reach the global optimum. In particu- 
lar, locally suboptimal steps can increase the search range. We use a modified 
BEAM algorithm (Aha and Bankert (1995)) in which not only the best, but 
the q best local steps are stored in a queue and explored systematically. De- 
viating from the original BEAM algorithm, we allow either the adding of an 
unused feature or the exchange of a selected with an unused feature. 

Another global optimization method are genetic algorithms (GAs), which 
represent each feature subset as member of a population. Individuals can 
mutate (add or lose a feature) and mate with others (partly copy each other’s 
feature subsets) , where the probability of mating increases with the predictive 
performance of the individuals / subsets involved. It is thus possible to find 
solutions beyond the paths of a greedy sequential search. A disadvantage is 
the large number of parameters that need to be adjusted and the suboptimal 
performance that can result if the choice is poor. 



3.3 Efficiency 

The analysis of the runtime is important to understand the potential of the 
PC for speed-up. A naive measure of the computational effort is the total 
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count of multiplications. Although it is just a “quick and dirty” method 
ignoring memory traffic and other overheads, it provides good predictions. 

The coefficient matrix A for the PC is obtained by solving 

E{xx T } ■ A = E{xy T } (2) 

with x a column vector specifying the basis functions (i.e. the monomials 
built from the original features) of an individual observation and y a vector 
which is [1 0] T for one class and [0 1] T for the other. The expectation values 
are also called moment matrices. 

The computational effort mainly consists of two steps: estimation of the 
moment matrix E{xx T } and its inversion. The former requires D 2 N multi- 
plications, with N the number of observations and D = ( F q G ) the dimension 
of x, that is the feature space obtained by using all F original features as 
well as all monomials thereof up to degree G. 

In CV, the data is partitioned into k bins; accordingly, the N x D design 
matrix X can be partitioned into iVj x D matrices Xi, with Ni = N. 

The moment matrices are estimated for each bin separately by Xj X, . For 
the jth training in the course of a fc-fold CV, the required correlation matrix 
is obtained from 

X- j X- j = Y, X I x i (3) 

that is, D x D matrices are added only. 

In summary, while the correlation matrices need to be inverted in each 
of the k runs in a fc-fold CV (requiring a total of fc|ZJ 3 multiplications for a 
Gauss- Jordan elimination), they are recomputed at the cost of a few additions 
or subtractions only once the correlation matrices for individual bins have 
been built (requiring a total of D 2 N multiplications). 

In addition, once the correlation matrix for a full feature set F and poly- 
nomial degree G has been estimated, all moment matrices for F' C F and 
G' < G are obtained by a mere elimination of appropriate rows and columns. 

4 Experimental results 

The system has been tested on a dataset of 633 IO and 150 NIO image 
sequences which comprise a total of 5294 object hypotheses that have been 
labeled by a human expert. A large part of the IO sequences selected for 
training were “difficult” cases with sputter look-alikes. The loss function used 
was L jo, nio = 1 an d Lniojo = 100 and generalization performance was 
estimated using a single 10-fold CV. A total of 19 features were computed for 
each object hypothesis. The four subset selection strategies described were 
tested. For the modified BEAM algorithm, the parameter q = 5 and 20 
generations were used. The GA ran for 50 generations with 60 individuals 
each. 
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Fig. 2. Each point gives the generalization performance, as estimated by CV, for 
a particular subset of features and an optimized classifier. For a given subset, all 
classifier parameters such as aggregation function operator and its window width, 
degree of polynomial, and threshold 0 , were optimized using a grid search. 

Results are shown in Fig. 2. SBE works better than SFS on average, 
though their best results are similar (f = 0.015 and 0.014). BEAM and GA 
offer minor improvements (r = 0.010 and 0.012) only. 

Surprisingly, the final optimized system recognizes individual object hy- 
potheses with a low accuracy: r = 0.341 with L]si on - sputter, Sputter = 10 
and Lsputter, Non-Sputter — 1- The high performance obtained in the end is 
entirely due to the temporal aggregation of evidence from individual frames. 




number of features 



Fig. 3. Results obtained when replacing object estimates of class membership in 
individual images with the lower bound of interval estimates, see section 2.4. 



Figure 3 shows the results obtained when the membership index d(Ot,i) 
is not given by the object estimate obtained from the PC, but by the lower 
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bound of an interval estimate to reflect the strongly asymmetric loss function. 
Overall classification accuracy is not improved, but the magnitude of the 
interval can help identifying sequences that ought to be labeled manually 
and should be included in future training sets. 

5 Conclusion and outlook 

Since the number of objects, N, is typically much larger than the number of 
basis functions, D, the most expensive part in training a PC is the compu- 
tation of the correlation matrix and not its inversion. Recomputations of the 
former can be avoided in the framework of cross-validation, as illustrated in 
section 4 For our particular data set, advanced subset selection strategies did 
not lead to a much improved performance. 

Even though all features computed on object hypotheses were chosen with 
the aim of describing the phenomenon well, the generalization performance 
varies greatly with the particular subset that is chosen in a specific classifier. 
A systematic search for the optimal subset is thus well worth while, and is 
made possible by the low computational cost of the PC which allows for a 
systematic joint optimization of parameters and feature subset. 
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Abstract. Since the introduction of bagging and boosting many new techniques 
have been developed within the field of classification via aggregation methods. Most 
of them have in common that the class indicator is treated as a nominal response 
without any structure. Since in many practical situations the class must be consid- 
ered as an ordered categorical variable, it seems worthwhile to take this additional 
information into account. We propose several variants of bagging and boosting, 
which make use of the ordinal structure and it is shown how the predictive power 
might be improved. Comparisons are based not only on misclassification rates but 
also on general distance measures, which reflect the difference between true and 
predicted class. 



1 Introduction 

In statistical classification covariates are used to predict the value of an un- 
observed class variable. Various methods have been proposed and are nicely 
summarized e.g. in Hastie et al. (2001). 

In recent years especially the introduction of aggregation methods like 
bagging (Breiman (1996)) and boosting (Freund (1995), Freund and Schapire 
(1996)) led to spectacular improvements of standard techniques. In all these 
methods a basic discrimination rule is used not only once but in different 
(weighted or unweighted) bootstrap versions of the data set. 

A special problem is how to treat categorical ordered response variables. 
This additional information should be used to improve the accuracy of a 
classification technique. The purpose of our work is to combine aggregation 
methods with ordered class problems. Therefore aggregated classifiers for or- 
dered response categories are developed and compared considering empirical 
data sets. 

2 Aggregating classifiers 

In a classification problem each object is assumed to come from one out of k 
classes. Let L = {(yi, Xi), i = 1, . . . , til} denote the learning or training set of 
observed data, where yi £ {1, . . . , k} denotes the class and x\ = {xn, . . . , Xi P ) 
are associated covariates. Based on these p characteristics a classifier of the 
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form 



C(.,L) :X — > {1 A;} 

x — * C(x : L ) 



is built, where C(x : L) is the predicted class for observation x. In the following 
three variants of aggregated classifiers, that are used as building blocks later, 
are shortly sketched. 

Bagging (bootstrap aggregating) uses perturbed versions L m of the learn- 
ing set and aggregates the corresponding predictors by plurality voting, where 
the winning class is the one being predicted by the largest number of predic- 
tors 



The perturbed learning sets of size til are formed by drawing at random from 
the learning set L. The predictor C(.,L m ) is built from the m-th bootstrap 
sample. 

In boosting the data are resampled adaptively and the predictors are ag- 
gregated by weighted voting. The Discrete AdaBoost procedure starts with 
weights W\ = • • • = u> nL = 1/til, which form the resampling probabilities. 
Based on these probabilities the learning set L m is sampled from L with re- 
placement and the classifier C(., L m ) is built. The learning set is run through 
this classifier yielding error indicators e, = 1 if the i-tli observation is classi- 
fied incorrectly and e* = 0 otherwise. With 



After M steps the aggregated voting for an observation is obtained by 



Real AdaBoost (Friedman et al. (2000)) uses real valued classifier functions 
f(x,L) instead of C(x,L). Since the original algorithm only works for two 
class problems, we present a variant that can be used for more than two classes 
in the following: Again it starts with the weights w i = • • • = w nL = 1 /ul 
which form the resampling probabilities. The learning set is run through a 
classifier that yields class probabilities 





and c m = log 



the resampling weights are updated for the next step by 

Wiexp(CmCi) 





Pj{xi) = P(yi = j\xi) 
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Based on these probabilities real valued scores f : j(x l , L rn ) are built by 
fj(xi , L m ) = 0.5 • log Pi&i) ^ 

and the weights are updated for the next step by 

wj,;exp(-/ yi (xi,L m )) 

^i,new — sr^nL ( £ ( r \\ 

WjeXJp( — fyj\Xj, Lm)) 

After M steps the aggregated voting for observation x is obtained by 

( M 

E fj ( X , L m ) 

m— 1 

Both AdaBoost algorithms are based on weighted resampling. In alter- 
native versions of boosting observations are not resampled but the classifiers 
are computed by weighting the original observations by weights w i , ■ • • , w nL 
that are updated iteratively. Then C(.,L m ) should be read as the classifier 
based on the current weights in the m-th step. 

3 Ordinal prediction 

In the following it is assumed that the classes in y € { 1 , . . . , k} are ordered. 
In Fixed split boosting the classification procedure is divided into two stages: 
First aggregation is done by splitting the response categories. Then the re- 
sulting binary classifiers are combined. It works by defining 

,..( r ) _ f i > y ^ ■ ■ • > r } 

\ 2 , ye {r+l,...,k} 




for r = 1, . . . , k — 1. 

Let C( r )(., L) denote the classifier for the binary class problem defined by 
y( r \ For fixed r, by using any form of aggregate classifier one obtains the 
predicted class for observation x by computing 

( M 

E C m / ( CW ( X ' i t ) ) =j) 

m= 1 




These first stage aggregate classifiers ci r g \(.) have been designed for fixed 
split at r. The combination of . . . , C^g 1 \.) is based on the second 

stage aggregation, now by exploiting the ordering of the categories. Thereby 
let the result of the classifier be transformed into the sequence y\ , . . . ,y) 
of binary variables. 
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For C^(x) = 1 corresponding to y(x ) £ {1, . . . , r} one has 
Vi\x) =■■■ = y ( r\x) = i, yr'h{x) = ■ ■ ■ = Vk\x ) = 0 

For C^ r \x) = 2 corresponding to y(x) £ {r + 1 , . . . , k} one has 
y[ r \x) = ■■■ = yi r \x) = 0 , y^ (x) = ■■■ = y ( ^\x) = 



Thus the classifier d r J g (.) yields the binary sequence 



-•( 1 , 1 , .. ., 1 , 0 , 0 , ..., 0 ) 
r 



k — r 



( 0 , 0 ,. .., 0 , 1,1 1 ) 



where the change from 1 to 0 or 0 to 1 is after the r-th component. We divide 
these sequences by r or k — r respectively to take into account the different 
number of categories within each dichotomization. The final classifier is given 
by the second stage aggregation 



( jfe-i 
^ i)j r) 

r = 1 

In Fixed split boosting the ordinal structure of the response is not used in 
the reweighting scheme. Only in the final combining step it is exploited that 
the response is ordinal. Therefore in the following an alternative algorithm 
(called Ordinal Discrete AdaBoost) is suggested which connects the weights 
in Discrete AdaBoost to the ordered performance of the classifier. 

Again we start with weights w \ , . . . , w nL which form the resampling prob- 
abilities. Based on these probabilities the learning set L m is sampled from 
L with replacement. Based on L m the classifiers (., L m ) are built for all 
dichotomous splits of the ordinal class variable at value r. The learning set 
is run through each classifier C^ r \., L m ) yielding the information if the *-th 
observation is predicted into a class higher or lower than r. The results of the 
classifiers for different split values r are combined by majority vote into the 
aggregated classifier C(.,L m ). 

Let the error indicators now be given by 

_ \C(xj,L m ) - yj\ 

Ci k-i 

Therefore with e m = Y^i^i w i e i an< l c m = log((l — e m )/e m ) the weights are 
updated by 

uiiexp(c m ei) 

Wi new — v— > n l / \ 

2^j = 1 Wjexp(Cm€j) 

After M steps the aggregated voting for observation x is obtained by 

( M 

E (C Lm ) — j ') 

m — 1 
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In a similar way an ordinal version of Real AdaBoost (called Ordinal 
Real AdaBoost ) can be developed: For each dichotomization probabilities 
p( r )(x) = P(y < r\x) are provided by a dichotomous classifier. From these 
probabilities one obtains a sequence of scores y ^ (x) for each class by 

y[ r) (x) = --- = yl r) (x) = p (r) ( x ) , yfh {x) = --- = y^ ) (x) = 1 - p {r) (x) . 
For the aggregation across splits one considers the value 

k - 1 

yj(x) = Y,rt r) w 

r— 1 



which reflects the strength of prediction in class j. Then an ordinal algorithm, 
that still shows a close relationship to Real AdaBoost, works as follows: Based 
on weights wi , . . . , w nL a classifier yj(.) for ordered classes 1, . . . , k is built 
and the learning set is run through it yielding scores yj(xi), j = 1 
Based on these scores real valued terms fj(xi , L m ) are built by 



A m ) — 0.5 • log- 






yj(xi) 

-E ^jduyiixi) 



where dij is the distance between true class and class j for observation i. 
So class probabilities are weighted according to the distance between current 
and true class. The weights are updated for the next step by 



^ i,new 



w i; exp(-/ yi (xj,£ m )) 
Ej=i w j ex P(— fyj (xj ^ L m )) 



After M steps the aggregated voting for observation x is obtained by 



argmaXj 



' M 



fj(x, L r 



In Friedman et al. (2000) a variant for Real AdaBoost, called Gentle 
AdaBoost, is suggested, which uses a different update function and seems 
to work more stable. Without a detailed description of this algorithm, the 
results of the ordinal adaption just in the same way as for Real AdaBoost 
are presented in the empirical part. 

Finally a boosting variant is presented, that originally was developed to 
predict binary classes or real valued variables, but is easily transformed to 
cope with ordinal classes: L 2 - Boost (Biihlmann and Yu (2002)), a special case 
of the more general gradient descent boosting algorithm, works without any 
kind of weighting. In the first step a real- valued initial learner Fq(x) = /( x) is 
computed by means of least squares min^" = (' 1 (y i — f(xi)) 2 . Then the iteration 
starts with m = 0. The negative gradient vector 

Ui = yi - F m {xi) 
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is computed and the real- valued learner f m+ i(x) is now fit to these current 
residuals, again by means of least squares min Y^=ii u i — f m +i(%i)) 2 ■ Finally 
the prediction F m+ i(x) is updated by 

= Fm (x) T fm+ 1(*^) 

and the iteration index m is increased by one. 

As the mean squared error is a sensitive indicator for ordinal distances, 
this algorithm can be used for ordinal classification with only one little ad- 
justment: The values of F m + i(x) are rounded to the nearest class label in 
order to follow the allowed domain of y. 

4 Empirical studies 

The scapula data are part of a dissertation (Feistl and Penning (2001)) writ- 
ten at the Institut fur Rechtsmedizin der LMU Munchen. The aim was to 
predict the age of dead bodies only by means of the scapula. Therefore a lot 
of measures, implying angles, lengths, descriptions of the surface, etc. were 
provided. We preselected 15 important covariates to predict age, which was 
splitted into 8 distinct ordinal classes. Each class covers ten years. The data 
set consists of 153 complete observations. 

In the following we compare the different ordinal approaches to simpler 
alternative methods, that either do not use the ordinal information within 
the data or do not use aggregation techniques. The first one is a simple clas- 
sification tree, called nominal CART. Here a tree is built by means of the 
deviance criterion and grown up to maximal depth. Afterwards it is pruned 
on the basis of resubstitution misclassification rates until a fixed tree size 
(given by the user and dependent on the data set) is reached. An alternative 
approach which uses the ordering of the classes, but still without bagging 
or boosting, is to build a tree for every dichotomization in r separately and 
aggregate them according to argmax^ X^r=i (x. L) = j). This method 

is called ordinal CART. In the same way two bagging variants are considered: 
A nominal approach, where every tree predicts the multi class target variable 
and the final result is obtained by a majority vote of these predictions, and 
ordinal bagging, in which bagging is applied to fixed splits. The results are ag- 
gregated over dichotomizations according to argmax^ X^r=i I{C^ r \x, L) = j) 
and over the bagging cycles. The nominal methods are considered as baseline 
for possible improvements by ordinal bagging or ordinal boosting. 

In addition we consider ten different boosting versions: The simple nomi- 
nal Discrete, Real and Gentle AdaBoost, which do not use the ordinal struc- 
ture within the data, are used for comparison only. The new methods are the 
ordinal boosting techniques: On the one hand we distinguish between real, 
discrete and gentle methods, on the other hand between Fixed split boosting 
and Ordinal AdaBoost. Finally L 2 -Boost results are shown. 
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The evaluation of the methods is based on several measures of accuracy. 
As a raw criterion for the accuracy of the prediction we take the misclassifi- 
cation error rate i XI "=i ^{yi^yi}- But * n the case of ordinal class structure 
measures should take into account that a larger distance is a more severe 
error than a wrong classification into a neighbour class. Therefore we use the 
mean absolute value (here called mean abs) of the differences - ^" =1 \yi — yi\ 
and the mean squared difference (here called mean squ) y (?A — Vi) 2 
which penalizes larger differences even harder. 

When measuring accuracy one has to distinguish between resubstitution 
error and validation (or test) error. Resubstitution error is used to examine 
how fast the misclassification error can be lowered by the different techniques, 
but because of its bias it is not an appropriate measure for prediction accu- 
racy. Therefore we divide the data set at random into two parts consisting 
of one respectively two thirds of the observations. From the larger (learning) 
data set the classification model is built and the observations of the smaller 
(test) data set are predicted. We use 50 different random splits into learning 
and test set and give the mean over these splits. As testing for differences 
between the performance of various techniques is quite difficult because of 
the statistical dependence between the different test sets, we omit it in the 
framework of this study. 

When aggregating classifiers one has to choose the number of cycles, which 
means the number of different classifiers that are combined in one bagging or 
boosting run. As standard we use a number of 50 cycles in this study. The last 
parameter that has to be chosen is the (fixed) tree size, that is defined as the 
number of terminal nodes of each tree. Here we use a number of 15 terminal 
nodes in the nominal approach, which seems necessary for a problem with 8 
classes and after all 15 covariates. All ordinal approaches are performed with 
trees of size 5, because as far as trees are concerned only two class problems 
are treated. 

The interpretation of Table 1 leads to the following conclusions: As far 
as the misclassification error is concerned there are only slight differences 
between the classifiers. However, the more important measures for problems 
with ordinal classes are the distance measures. Here the results of CART are 
improved by all aggregation methods. Especially Fixed split boosting, but 
also the other ordinal boosting methods and ordinal bagging perform very 
well. For example the mean squared distance 2.365 of the classification tree 
is reduced to 1.215 by Discrete Fixed split boosting. 



5 Concluding remarks 

The concept to combine aggregating classifiers with techniques for ordinal 
data structure led to new methods that can be compared with common clas- 
sification techniques. In further studies (Tutz and Hechenbichler (2003)) we 
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Table 1 . Test error for scapula data 



method 


misclass mean abs mean squ 


nominal CART 


0.676 


1.085 


2.365 


ordinal CART 


0.652 


0.995 


2.112 


nominal bagging 


0.663 


0.982 


1.925 


ordinal bagging 


0.628 


0.828 


1.375 


Discrete AdaBoost 


0.649 


0.932 


1.747 


Real AdaBoost 


0.646 


0.904 


1.619 


Gentle AdaBoost 


0.643 


0.876 


1.502 


Discrete Fixed split boosting 


0.629 


0.799 


1.215 


Real Fixed split boosting 


0.646 


0.818 


1.244 


Gentle Fixed split boosting 


0.638 


0.808 


1.216 


Ordinal Discrete AdaBoost 


0.611 


0.841 


1.473 


Ordinal Real AdaBoost 


0.691 


0.878 


1.330 


Ordinal Gentle AdaBoost 


0.644 


0.806 


1.210 


1,2-Boost 


0.652 


0.851 


1.316 



found promising results for other empirical data sets as well. Ordinal tech- 
niques definitely improve the performance of a simple CART tree. 

All in all there seems to be no dominating method as for different data 
sets the best results occur by different methods. Although ordinal bagging 
shows satisfying results for all data sets, it often is outperformed by at least 
one of the ordinal boosting techniques. These findings suggest that further 
research seems to be a worthwhile task. 
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Abstract. Cluster validation is necessary because the clusters resulting from clus- 
ter analysis algorithms are not in general meaningful patterns. I propose a method- 
ology to explore two aspects of a cluster found by any cluster analysis method: the 
cluster should be separated from the rest of the data, and the points of the cluster 
should not split up into further separated subclasses. Both aspects can be visually 
assessed by linear projections of the data onto the two-dimensional Euclidean space. 
Optimal separation of the cluster in such a projection can be attained by asym- 
metric weighted coordinates (Hennig (2002)). Heterogeneity can be explored by the 
use of projection pursuit indexes as defined in Cook, Buja and Cabrera (1993). The 
projection methods can be combined with splitting up the data set into clustering 
data and validation data. A data example is given. 



1 Introduction 

Cluster validation is the assessment of the quality and the meaningfulness 
of the outcome of a cluster analysis (CA). Most CA methods generate a 
clustering in all data sets, whether there is a meaningful structure or not. 
Furthermore, most CA methods partition the data set into subsets of a more 
or less similar shape, and this may be adequate only for parts of the data, 
but not for all. Often, different CA methods generate different clusterings 
on the same data and it has to be decided which one is the best, if any. 
Therefore, if an interpretation of a cluster as a meaningful pattern is desired, 
the cluster should be validated by information other than the output of the 
CA. A lot of more or less formal methods for cluster validation are proposed 
in the literature, many of which are discussed, e.g., in Gordon (1999, Section 
7.2) and Halkidi et al. (2002). Six basic principles for cluster validation can 
be distinguished: 

Use of external information External information is information that 
has not been used to generate the clustering. Such information can stem 
from additional data or from background knowledge. However, such in- 
formation is often not available. 

Significance tests for structure Significance tests against null models 
formalizing “no clustering structure at all” are often used to justify the 
interpretation of a clustering. While the rejection of homogeneity is a rea- 
sonable minimum requirement for a clustering, such tests cannot validate 
the concrete structure found by the CA algorithm. 



154 Hennig 



Comparison of different clusterings on the same data Often, the 
agreement of clusterings based on different methods is taken as a con- 
firmation of clusters. This is only meaningful if sufficiently different CA 
methods have been chosen, and in the case of disagreement it could be 
argued that not all of them are adequate for the data at hand. 
Validation indexes In some sense, the use of validation indexes is similar to 
that of different clusterings, because many CA methods optimize indexes 
that could otherwise be used for validation. 

Stability assessment The stability of clusters can be assessed by tech- 
niques such as bootstrap, cross-validation, point deletion, and addition 
of contamination. 

Visual inspection Recently (see, e.g., Ng and Huang (2002)), it has been 
recognized that all formal approaches of cluster validation have limita- 
tions due to the complexity of the CA problem and the intuitive nature 
of what is called a “cluster”. Such a task calls for a more subjective 
and visual approach. To my knowledge, the approach of Ng and Huang 
(2002) is the first visual technique which is specifically developed for the 
validation of a clustering. 

Note that these principles address different aspects of the validation problem. 
A clustering that is well interpretable in the light of external information will 
not necessarily be reproduced by a different clustering method. Structural 
aspects such as homogeneity of the single clusters and heterogeneity between 
different clusters as indicated by validation indexes or visual inspection are 
not necessarily properties of clusters which are stable under resampling. How- 
ever, these aspects are not “orthogonal”. A well chosen clustering method 
should tend to reproduce well separated homogeneous clusters even if the 
data set is modified. 

In the present paper, a new method for visual cluster validation is pro- 
posed. As opposed to the approach of Ng and Huang (2002), the aim of the 
present method is to assess every cluster individually. The underlying idea is 
that a valid cluster should have two properties: 

• separation from the rest of the data, so that it should not be joined with 
other parts of the data, 

• homogeneity, so that the points of the cluster can be said to “belong 
together” . 

In Section 2, asymmetric weighted coordinates (AWCs) are introduced. AWCs 
provide a linear projection of the data in order to separate the cluster un- 
der study optimally from the rest of the data. In Section 3, I propose the 
application of some projection pursuit indexes to the points of the cluster 
to explore its heterogeneity. Additionally, if there is enough data to split the 
data set into a “training sample” and a “validation sample”, the projections 
obtained from clustering and visual validation on the training sample can be 
applied also to the points of the validation sample to see if the found pat- 
terns can be reproduced. Throughout the paper, the data is assumed to come 
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from the p-dirnensional Euclidean space. The methods can also be applied 
to distance data after carrying out an appropriate multidimensional scaling 
method. Euclidean data is only needed for the validation; the clustering can 
be done on the original distances. In Section 4, the method is applied to a 
real data set. 

2 Optimal projection for separation 

The most widespread linear projection technique to separate classes goes back 
to Rao (1952) and is often called “discriminant coordinates” (DCs). The first 
DC is defined by maximizing the ratio 



n denotes the number of points, rii is the number of points of class i, s 
denotes the number of classes, x^i , . . . , x^ ni are the p-dimensional points of 
class i, m, is the mean vector of class i and m is the overall mean. The 
further DCs maximize F under the constraint of orthogonality to the previous 
DCs w.r.t. W. B is a covariance matrix for the class means and W is a 
pooled within-class covariance matrix. Thus, F gets large for projections 
that separate the means of the classes as far as possible from each other 
while keeping the projected within-class variation small. Some disadvantages 
limit the use of DCs for cluster validation. Firstly, separation is formalized 
only in terms of the class means, and points of different classes far from 
their class means need not to be well separated (note that the method of Ng 
and Huang (2002) also aims at separating the cluster centroids). Secondly, 
s — 1 dimensions are needed to display all information about the separation 
of s classes, and therefore there is no guarantee that the best separation of 
a particular cluster shows up in the first two dimensions in case of s > 3. 
This could in principle be handled by declaring the particular cluster to be 
validated as class 1 and the union of all other clusters as class 2 (this will 
be called the “asymmetry principle” below). But thirdly, DCs assume that 
the covariance matrices of the classes are equal, because otherwise W would 
not be an adequate covariance matrix estimator for a single class. If the 
asymmetry principle is applied to a clustering with s > 2, the covariance 
matrices of these classes cannot be expected to be equal, not even if they 
would be equal for the s single clusters. 

A better linear projection technique for cluster validation is the appli- 
cation of asymmetric linear dimension reduction (Hennig (2002)) to the two 
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classes obtained by the asymmetry principle. Asymmetry means that the two 
classes to be projected are not treated equally. Asymmetric discriminant co- 
ordinates maximize the separation between class 1 and class 2 while keeping 
the projected variation of class 1 small. Class 2, i.e., the union of all other 
data points, may appear as heterogeneously as necessary. Four asymmet- 
ric projection methods are proposed in Hennig (2002), of which asymmetric 
weighted coordinates (AWCs) are the most suitable for cluster validation. 
The first AWC is defined by maximizing 



d > 0 being some constant, for example the 0.99-quantile of the ^"dis- 
tribution. The second AWC C 2 maximizes F* subject to c' 1 Sj“ 1 C 2 = 0 and 
so on. c( B* Ci gets large if the projected differences between points of class 
1 and class 2 are large. The weights w 3 downweight differences from points 
of class 2 that are very far away (in Mahalanobis distance) from class 1. 
Otherwise, c^B*^ would be governed mainly by such points, and class 1 
would appear separated mainly from the furthest points in class 2, while 
it might be mixed up more than necessary with closer points of class 2. 
The weights result in a projection that separates class 1 also from the closest 
points as well as possible. More motivation and background is given in Hennig 
(2002). As for DCs, the computation of AWCs is very easily be done by an 
Eigenvector decomposition of Sj" 1 B*. Note that AWCs can only be applied if 
ni > P, because otherwise class 1 could be projected onto a single point, thus 
c' 1 S)” 1 Ci = 0. If n i is not much larger than p , c' 1 Sj _1 Ci can be very small, 
and some experience (e.g., with simulated data sets from unstructured data) 
is necessary to judge if a seemingly strong separation is really meaningful. 

3 Optimal projection for heterogeneity 

Unfortunately, AWCs cannot be used to assess the homogeneity of a cluster. 
The reason is that along projection directions that do not carry any informa- 
tion regarding the cluster, the cluster usually does not look separated, but 
often more or less homogeneous. Thus, to assess separation, the projected 
separation has to be maximized, which is done by AWCs. But to assess ho- 
mogeneity, it is advantageous to maximize the projected heterogeneity of the 
cluster. 





B* = y ^Wj(x u - x 2j )(x u - x 2 j)' , 
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Projection pursuit is the generic term for linear projection methods that 
aim for finding “interesting”, i.e., heterogeneous projections of the data (Hu- 
ber (1985)). The idea is to project only the points of the cluster to be vali- 
dated in order to find a most heterogeneous visualization. There are lots of 
projection pursuit indexes. Some of them are implemented in the data visual- 
ization software XGOBI (Buja et al. (1996)). A main problem with projection 
pursuit is that the indexes can only be optimized locally. XGOBI visualizes 
the optimization process dynamically, and after a local optimum has been 
found, the data can be rotated toward new configurations to start another 
optimization run. 

Two very simple and useful indexes have been introduced by Cook et al. 
(1993) and are implemented in XGOBI. The first one is the so-called “holes 
index”, which is defined by minimizing 

rai 

F**(C) = 5> 2 (C' Xli ), 

i=i 

over orthogonal p x 2-projection matrices C, where ip 2 denotes the density of 
the two-dimensional Normal distribution and the points Xi, are assumed to 
be centered and scaled. F** becomes minimal if as few points as possible are 
in the center of the projection, in other words, if there is a “hole” . Often, such 
a projection shows a possible division of the cluster points into subgroups. 

It is also useful to maximize F**, which is called “central mass index” in 
XGOBI. This index attempts to project as many points as possible into the 
center, which can be used to find outliers in the cluster. But it can also be 
useful to try out further indexes, as discussed in Cook et al. (1993). 

4 Example 

As an example, two CA methods have been applied to the “quakes” data set, 
which is part of the base package of the free statistical software R (to obtain 
from www.R-project.org). The data consist of 1000 seismic events on Fiji, 
for which five variables have been recorded, namely geographical longitude 
and latitude, depth, Richter magnitude and number of stations reporting. 
Because of the favorable relation of n to p, I divided the data set into 500 
points that have been used for clustering and 500 points for validation. 

The first clustering has been generated by MCLUST (Fraley and Raftery 
(2003)), a software for the estimation of a Normal mixture model including 
noise, i.e., points that do not belong to any cluster. The Bayesian information 
criterion has been used to decide about the number of clusters and the com- 
plexity of the model for the cluster’s covariance matrices. It resulted in four 
clusters with unrestricted covariance matrices plus noise. As a comparison, I 
have also performed a 5-means clustering on sphered data. 

Generally, the validity of the clusters of the MCLUST-solution can be 
confirmed. In Figure 1, the AWC plot is shown for the second cluster (points 
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Fig. 1. Left: AWCs of cluster 2 (black points) of the MCLUST solution. Right: 
validation data set projected onto the AWCs. 





Fig. 2. Left: AWCs of cluster 3 (black points) of the MCLUST solution. Right: 
AWCs of cluster 1 (black points) of the 5-means solution. 



of other clusters are always indicated with the cluster numbers) . These points 
do neither appear separated in any scatterplot of two variables nor in the 
principal components (not shown), but they are fairly well separated in the 
AWC plot, and the projection of the validation points on the AWCs (right 
side) confirms that there is a meaningful pattern. Other clusters are even 
better separated, e.g., cluster 3 on the left side of Figure 2. Some of the 
clusters of the 5-means solution have a lower quality. For example, the AWC- 
plot of cluster 1 (right side of Figure 2) shows the separation as dominated 
by the variation within this cluster. 
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Fig. 3. Left: “holes” projection of cluster 2 of the MCLUST solution. Right: “holes” 
projection of cluster 3 of the MCLUST solution. 




Fig. 4. Left: “holes” projection of cluster 1 of the 5-means solution. Right: “central 
mass” projection of cluster 4 of the 5-means solution. 



Optimization of the holes index did not reveal any heterogeneity in 
MCLUST-cluster 2, see the left side of Figure 3, while in cluster 3 (right 
side) two subpopulations could roughly be recognized. Sometimes, when ap- 
plying MCLUST to other 500-point subsamples of the data, the correspond- 
ing pattern is indeed divided into two clusters (it must be noted that there is a 
non-negligible variation in the resulting clustering structures from MCLUST, 
including the estimated number of clusters, on different subsamples). Some 
of the 5-means clusters show a much clearer heterogeneity. The holes index 
reveals some subclasses of cluster 1 (right side of Figure 4), while the central 
mass index highlights six outliers in cluster 4 (right side). 
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5 Conclusion 

A combination of two plots for visual cluster validation of every single cluster 
has been proposed. AWCs optimize the separation of the cluster from the 
rest of the data while the cluster is kept homogeneous. Projection pursuit is 
suggested to explore the heterogeneity of a cluster. 

Note that for large p compared to n, the variety of possible projections is 
large. Plots in which the cluster looks more or less separated or heterogeneous 
are found easily. Thus, it is advisable to compare the resulting plots with the 
corresponding plots from analogous cluster analyses applied to data with the 
same n and p generated from “null models” such as a normal or uniform 
distribution to assess if the cluster to be validated yields a stronger pattern. 
This may generally be useful to judge the validity of visual displays. 

The proposed plots are static. This has the advantage that they are repro- 
ducible (there may be a non-uniqueness problem with projection pursuit) and 
they are optimal with respect to the discussed criteria. However, a further dy- 
namical visual inspection of the data by, e.g., the grand tour as implemented 
in XGOBI (Buja et al. (1996)), can also be useful to assess the stability of 
separation and heterogeneity as revealed by the static plots. 

AWCs are implemented in the add-on package FPC for the statistical 
software package R, available under www.R-project.org. 
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Abstract. Boosting algorithms combine moderately accurate classifiers in order 
to produce highly accurate ones. The most important boosting algorithms are Ad- 
aboost and Arc-x(j). While belonging to the same algorithms family, they differ 
in the way of combining classifiers. Adaboost uses weighted majority vote while 
Arc-x(j) combines them through simple majority vote. Breiman (1998) obtains the 
best results for Arc-x(j) with j — 4 but higher values were not tested. Two other 
values for j, j = 8 and j = 12 are tested and compared to the previous one and 
to Adaboost. Based on several real binary databases, empirical comparison shows 
that Arc-x4 outperforms all other algorithms. 



1 Introduction 

Boosting algorithms are one of the most recent developments in classification 
methodology. They repeatedly apply a classification algorithm as a subrou- 
tine and combine moderately accurate classifiers in order to produce highly 
accurate ones. The first boosting algorithm, developed by Schapire(1990), 
converts a weak learning algorithm into a strong one. A strong learning al- 
gorithm achieves low error with high confidence while a weak learning algo- 
rithm drops the requirement of high accuracy. Freund (1995) presents another 
boosting algorithm, boost-by-majority, which outperforms the previous one. 

Freund and Schapire (1997) present another boosting algorithm, Ad- 
aboost. It is the first adaptive boosting algorithm because its strategy de- 
pends on the advantages of obtained classifiers, called hypotheses. For binary 
classification, the advantage of a hypothesis measures the difference between 
its performance and random guessing. The only requirement of Adaboost 
is to obtain hypotheses with positive advantage. Furthermore, the final hy- 
pothesis is a weighted majority vote of the generated hypotheses where the 
weight of each hypothesis depends on its performance. Due to its adaptive 
characteristic, Adaboost has received more attention than its predecessors. 
Experimental results (Freund and Schapire (1996), Bauer and Kohavi (1999)) 
show that Adaboost decreases the error of the final hypothesis. 

Breiman (1998) introduces the ARCING algorithm’s family: Adaptively 
Resampling and Combining which Adaboost belongs to. In order to bet- 
ter understand the behavior of Adaboost, Breiman (1998) develops a sim- 
pler boosting algorithm denoted by Arc-x(j). This algorithm uses a different 
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weight updating rule and combines hypotheses using simple majority vote. 
The best results of Arc -x(j) are obtained for j = 4. When compared to Ad- 
aboost, Breiman’s results show that both algorithms perform equally well. 
Breiman (1998) argues that the success of Adaboost is not due to its way of 
combining hypotheses but on its adaptive property. He argues also that since 
higher values for j were not tested further improvement is possible. 

In this paper, two other values for the parameter j of Arc-x(j) algorithm, 
j = 8 and j = 12, are tested and their performance compared to Adaboost 
and Arc-x4 in the subsampling framework using a one node decision tree 
algorithm. 

In section two, the different boosting algorithms used are briefly intro- 
duced. In section three, the empirical study is described and the results are 
presented. Finally, section four provides a conclusion to this article. 

2 Arcing algorithms 

Adaboost was the first adaptive boosting algorithm. First, the general frame- 
work of boosting algorithms is introduced, then Adaboost and some of its 
characteristics are reviewed. Finally, arcing algorithm’s family is discussed. 

Given a labeled training set (xi,yi), . . . , ( x n ,y n ), where each Xi belongs 
to the instance space A", and each label yt to the label set Y. Here only 
the binary case is considered where Y = {—1,1}. Adaboost applies repeat- 
edly, in a series of iterations t = 1 , ,T, the given learning algorithm to 
a reweighted training set. It maintains a weight distribution over the train- 
ing set. Starting with equal weight assigned to all instances, D(xt) = 1/n, 
weights are updated after each iteration such that the weight of misclassi- 
fied instances is increased. Weights represent instance importance. Increasing 
instance’s weight will give it more importance and thus forcing the learning 
algorithm to focus on it in the next iteration. The learning algorithm outputs 
in each iteration a hypothesis that predicts the label of each instances h t (xi). 
For a given iteration, the learning algorithm tends to minimize the error: 



e t = Pr[h t (xi ) ± yi], 



(1) 



where Pr[.] denote empirical probability on the training sample. 



2.1 Adaboost 

Adaboost requires that the learning algorithm outputs hypotheses with error 
less than 0.5. A parameter at is used to measure the importance assigned 
to each hypothesis. This parameter depends on hypothesis’ performance. For 
the binary case this parameter is set to: 

1 , A — £ t> 
a t = — ln( ). 

2 £t 



(2) 
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The weight distribution is updated using a t (see Figure 2.1). This parame- 
ter is positive because Adaboost requires that the learning algorithm output 
hypotheses with error less than 0.5. At the end of the process, a final hypoth- 
esis is obtained by combining all hypotheses from previous iterations using 
weighted majority vote. The parameter at represents the weight of the hy- 
pothesis ht generated in iteration t. The pseudocode of Adaboost for binary 
classification is presented in Figure 2.1. 

Adaboost requires that the base learner performs better than random 
guessing. The error can be written as follows: 

e t = 1/2 — 7t, (3) 

where "ft is a positive parameter that represents the advantage of the hy- 
pothesis over random guessing. The training error of the final hypothesis is 
bounded by: 

X\2s/e t {l-et) (4) 



Given: (aq, t/i ),..., {x n , y n ) where i i el,t/ 1 eb = {-1; +1} 

1- Initialize D\{i) =l/n 

2- For t = 1 to T: 

• Train the weak classifier using D t and get a hypothesis 
ht : X i— > {—1; +1} 

• Compute e t = £ i:htixt) ^ Vi D t{*i) 

• If e t > 0.5 stop. 

• Choose: ^ ln(i^t) 

• Update: D t+1 (i) = ”«(0e*p(-°ti«Ms«)) 

where Z t is a normalization factor 

3- output the final hypothesis: H[x ) = sign(^2f=i o-thtix)) 



Figure 2.1: Adaboost algorithm 

This bound can be expressed in term of the advantage sequence "ft'- 

e t {l ~ e-t) < exp(— 2^7 1 2 ). (5) 

t t 

Thus, if each hypothesis is slightly better than random guessing, that is 
"ft > 7 for 7 > 0, the training error will drop exponentially fast. 
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The bound of generalization error, or the error of the final hypothesis 
over the whole instance space X , depends on the training error, the size of 
the sample n, the Vapnik-Chervonenkis (VC, Vapnik 1998) dimension d of 
the weak hypothesis space and the number of boosting iterations T . The 
generalization error is at most: 

Pr[H(x) yf y\+6 j • ( 6 ) 

This bound depends on the number of iterations T and we would think that it 
will overfit as T becomes large but experimental results (Freund and Schapire 
(1996)) show that Adaboost continue to drop down generalization error as T 
becomes large. 



2.2 Arcing family 

Breiman (1998) used the ARCING term to describe the family of algorithms 
that Adaptively Resample data and Combine the outputted hypotheses. Ad- 
aboost was the first example of an arcing algorithm. 

In order to study the behavior of Adaboost, Breiman developed an ad-hoc 
algorithm, Arc-x(j). This algorithm is similar to Adaboost but differs in the 
following: 

• it uses a simpler weight updating rule: 



Dt+i(i) 



1 + m(iy 
£(1 + m(i)jy 



(7) 



where m(i) is the number of misclassifications of instance i by classifiers 
1 , ,t and j is an integer. 

• classifiers are combined using simple majority vote. 



Since the development of arcing family, Adaboost and Arc-x4 were compared 
in different framework and using different collections of datasets. Breiman 
(1998) and Bauer and Kohavi (1999) show that Arc-x4 has an accuracy com- 
parable to Adaboost without using the weighting scheme to construct the 
final classifier. Breiman (1998) argues that higher values of j were not tested 
so improvement is possible. 

In this empirical study, two other values of the parameter j, j = 8 and 
j = 12, are tested in the subsampling framework and compared to Adaboost 
and Arc-x4. 



3 Empirical study 

First, the base classifier and the performance measure used in the experi- 
ments are introduced then we the experimental results of each algorithm are 
presented. Finally, the performance of all algorithms are compared. 
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3.1 Base classifier and performance measure 

Boosting algorithms require a base classifier as a subroutine that performs 
slightly better than random guessing. In our experiments, we use a simple 
algorithm, developed by Iba and Langley (1992), that induces a one node 
decision tree from a set of preclassified training instances. 

In order to compare different boosting algorithms, we use a collection of 
binary data sets from UCI Machine learning Repository (Blakes et al. (1998)). 
Details of these data sets are presented in Table 2. 

For each data set, we repeat the experiment 50 times. Each time, the data 
set is randomly partitioned into two equally sized sets. Each set is used once 
as a training set and once as a testing set. We run each algorithm for T = 
25 and 75 iterations and report the average test error. 

Bauer and Kohavi (1999) measures of performance are used. For a fixed 
number of iterations, the performance of each algorithm is evaluated using 
test error averaged over all data sets. To measure improvement produced by 
a boosting algorithm, absolute test error reduction and relative test error 
reduction are used. 

3.2 Results 

Results are reported in Table 1 and interpreted as follows: for a fixed number 
of iterations, we evaluate the performance of each algorithm on the collection 
of data sets and on each data set. Then all algorithms are compared for 25 
and 75 iterations using test error averaged over all data sets. 



Table 1 . Average test error for each algorithm for 25 and 75 iterations on each 
data set. 



base 


Adaboost Arc-x4 Arc-x8 Arc-xl2 


Data Classifier 


25 75 


25 75 


25 75 


25 75 


Liv. 41.81 % 
Hea. 28.96% 
Ion. 18.93% 
Bre. 8.32% 
Tic. 34.66% 


29.78% 29.35% 
19.58% 20.38% 
12.38% 11.32% 
4.56% 4.62% 
28.80% 28.68% 


29.96% 28.94% 
18.99% 18.93% 
12.27% 11.83% 
3.88% 3.87% 
29.59% 28.43% 


31.94% 29.24% 
20.25% 19.41% 
12.22% 11.21% 
4.22% 4.14% 
31.38% 28.82% 


34.28% 29.60% 
21.79% 20.07% 
12.59% 11.01% 
4.40% 4.26% 
30.11% 29.36% 


mean 26.54% 


19.02% 18.87% 


18.94% 18.40% 


20.00% 18.56% 


20.63% 18.86% 



Adaboost results: Adaboost decreases the average test error by 7.52% for 
25 iterations and by 7.67% for 75 iterations. All data sets have relative test 
error reduction higher than 15%. The results for 75 iterations are better than 
those obtained for 25 iterations except for breast cancer data and heart data. 
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Table 2. Data sets used in the experimental study 



Data set number of instances number of attributes 



Liver disorders (Liv) 


345 


7 


Heart (Hea) 


270 


13 


Ionosphere (Ion) 


351 


34 


Breast cancer (Bre) 


699 


10 


Tic tac toe (Tic) 


958 


9 



Arc-x(j) results: All Arc-x(j) algorithms decrease the test error. The rel- 
ative test error reduction is higher than 15% for all datasets except when 
Arc-x(j) algorithms are applied for 25 iterations on the tic tac toe dataset. 
Results produced for 75 iterations are better than those obtained for 25 iter- 
ations. 



Comparing algorithms: When comparing the results of the different 
boosting algorithms for 25 and 75 iterations, we notice that: 

• For 25 iterations, the lowest average test error is produced by Arc-x4 
algorithm. 

• The relative average error reduction between Arc-x4 and Adaboost is 
0.43% which is not significant. 

• The average error of Arc-x4 is better than the average error of Arc-x8 by 
5.62% and by 8.96% for Arc-xl2 which are significant at 5% level. 

• Arc-x4 and Adaboost produce the lowest error on 2 databases, Arc-x8 
outperforms the other algorithms on 1 data set. 

• For 75 iterations, Arc-x4 outperforms all other algorithms. 

• Adaboost and Arc-xl2 performs equally well and less accurately than 
Adaboost and Arc-x8. 

• Arc-x4 produces the lowest error on 4 data sets and Arc-xl2 on 1 data 
set. 

• The relative average error reduction between the lowest and the highest 
error is 2.55% which is not significant. 

4 Conclusion 

This empirical study is an extension to Breiman’s (1998) study on the family 
of boosting algorithms, the ARCING family. Two extensions of arcing weight 
updating rules are tested and compared to the one used by Breiman (1998) 
and to Adaboost in the subsampling framework. 

Our empirical study shows that, based on these empirical results, increas- 
ing the factor j of Arc-x(j) algorithm does not improve the performance of 
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Arcing algorithms. The absolute test error reduction is higher for the first 25 
iterations than for the last 50 iterations. It is interesting to look for another 
way of combining classifiers which gives more weight to the first ones and 
thus producing lower test error. 
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Abstract. This paper proposes a method of finding a discriminative linear trans- 
formation that enhances the data’s degree of conformance to the compactness hy- 
pothesis and its inverse. The problem formulation relies on inter-observation dis- 
tances only, which is shown to improve non-parametric and non-linear classifier per- 
formance on benchmark and real-world data sets. The proposed approach is suitable 
for both binary and multiple-category classification problems, and can be applied 
as a dimensionality reduction technique. In the latter case, the number of necessary 
discriminative dimensions can be determined exactly. The sought transformation is 
found as a solution to an optimization problem using iterative majorization. 



1 Introduction 

Efficient algorithms, developed originally in the field of multidimensional scal- 
ing (MDS), quickly gained popularity and paved their way into discriminant 
analysis. Koontz and Fukunaga (1972), as well as Cox and Ferry (1993) pro- 
posed to include class membership information in the MDS procedure and 
recover a discriminative transformation by fitting a posteriori a linear or 
quadratic model to the obtained reduced-dimensionality configuration. The 
wide-spread use of guaranteed-convergence optimization techniques in MDS 
sparked the development of more advanced discriminant analysis methods, 
such as one put forward by Webb (1995), that integrated the two stages of 
scaling and model fitting, and determined the sought transformation as a 
part of the MDS optimization. These methods, however, focused mostly on 
deriving the transformation without adapting it to the specific properties of 
the classifier that is subsequently applied to the observations in the trans- 
formed space. In addition to that, these techniques do not explicitly answer 
the question of how many dimensions are needed to distinguish among a 
given set of classes. 

In order to address these issues, we propose a method that relies on an 
efficient optimization technique developed in the field of MDS and focuses on 
finding a discriminative transformation based on the compactness hypothesis 
(see Arkadev and Braverman (1966)). The proposed method differs from the 
above work in that it specifically aims at improving the accuracy of the non- 
parametric type of classifiers, such as nearest neighbor (NN), Fix and Hodges 
(1951), and can determine exactly the number of necessary discriminative 
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dimensions, since feature selection is embedded in the process of deriving the 
sought transformation. 

The remainder of this paper is structured as follows. In Section 2, we 
formulate the task of deriving a discriminant transformation as a problem of 
minimizing a criterion based on the compactness hypothesis. Then, in Section 
3, we demonstrate how the method of iterative majorization (IM) can be used 
to find a solution that optimizes the chosen criterion. Section 4 describes 
the extensions of the proposed approach for dimensionality reduction and 
multiple class discriminant analysis, whereas the details of our experiments 
are provided in Section 5. 

2 Problem formulation 

Suppose that we seek to distinguish between two classes represented by ma- 
trices X and Y having Nx and Ny rows of m-dimensional observations, 
respectively. For this purpose, we are looking for a transformation matrix 
T £ R mxfe , k <C m, that eventuates in compactness within members of one 
class, and separation within members of different classes. 

While the above preamble may fit just about any class-separating trans- 
formation method profile (e.g., Duda and Hart (1973)), we must emphasize 
several important assertions that distinguish the presented method and nat- 
urally lead to the problem formulation that follows. First of all, we must re- 
iterate that our primary goal is to improve the NN performance on the task 
of discriminant analysis. Therefore, the sought problem formulation must re- 
late only to the factors that directly influence the decisions made by the NN 
classifier, namely - the distances among observations. Secondly, in order to 
benefit as much as possible from the non-parametric nature of the NN, the 
sought formulation must not rely on the traditional class separability and 
scatter measures that use class means, weighted centroids or their variants 
which, in general, connote quite strong distributional assumptions. Finally, 
an asymmetric product form should be more preferable, justified as consistent 
with the properties of the data encountered in the target application area of 
multimedia retrieval and categorization, Zhou and Huang (2001). More for- 
mally, these requirements can be accommodated by an optimization criterion 
expressed in terms of distances among the observations from the two datasets 
as follows: 



where the numerator and denominator of (1) represent the geometric means of 
the within- and between-class distances defined as \J (xi — Xj)TT T (xi — Xj) T 



2 



J(T) 




(1) 
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and y/(xi — yj)TT T (xi — yj ) T , respectively, and '!'(■) denotes a Huber robust 
estimation function, Huber (1964), parametrized by a positive constant c and 
defined as: 



HdYj) 



i«) 2 if <<c; 

c <T - \° 2 if > c - 



(2) 



The choice of Huber function in (1) is motivated by the fact that at c the 
function switches from quadratic to linear penalty allowing to mitigate the 
consequences of an implicit unimodality assumption that the formulation of 
the numerator of (1) may lead to. In the logarithmic form, criterion (1) is 
written as: 



logJ(T) 



Nx 

N X (N X - 1) E lo S^ ( d b ( T )) 

i<j 

aS w (T) — PS B {T). 



1 

N x N y 



N x N y 



£5>gd*(T) (3) 



Our preliminary studies, Kosinov (2003), have shown that neither 
straightforward gradient descent nor some of the state-of-the-art optimization 
routines are suitable for solving the above optimization problem mostly due 
to susceptibility to local minima, adverse dependence on the initial value, and 
difficulties related to the discontinuities of the derivative of (3). However, by 
deriving some approximations of Sw{T) and Sb(T) one can make the task of 
minimizing log J(T) criterion amenable to a simple iterative procedure based 
on the majorization method (Borg and Groenen (1997), de Leeuw (1977), 
Heiser (1995)), which we discuss in the following section. 



3 Iterative majorization 

It can be verified that majorization remains valid under additive decom- 
position. Therefore, a possible strategy for majorizing (3) is to deal with 
Sw(T) and — S'b(T) separately and subsequently recombine their respective 
majorizing expressions. We begin by noting that both the logarithm and Hu- 
ber function are majorizable by linear and quadratic functions, respectively, 
Heiser (1995). This fact makes it possible to derive a majorizing function of 
Sw(T) as follows: 



N X 

E> 

%<j 



S W (T) = ^2\og*(d 



jW 



N x - 

Wi 



cn)<E 

i<j 



2V (d%{T)) 



+ K 1 = t i Sw (T,T),( 4) 



where T,T £ R mxm , T is a supporting point for T, Wy is a weight of the 
Huber function majorizer, that in this case is equal to 1 if <P r (d^(T)) < c 
or c/^ r (d^(T)) otherwise, and K\ is a constant term with respect to T. 
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Switching to matrix notation and defining a square symmetric design matrix 
B dependent on T: 



bij 






N X 

E 

k=l,k^i 



Hk 



if i ^ f, 
if i = j-, 



( 5 ) 



leads to the majorizing expression of Sw{T) in its final form: 

rs w (T, T) = \ tr (T t X t BXT) + K x . (6) 

An attempt to majorize —Sb(T) directly runs into problems due to the 
difficulties of finding a proper quadratic majorizing function of the negative 
logarithm. As a practical solution, we replace the neg-logarithm with its 
piece-wise linear approximation (see Figure 1, left panel), which, in turn, can 





Fig. 1 . Majorization of piecewise-linear approximation of —log(x) 



be represented as a sum of the functions defined as: 



g(x; x 0 ,l,r ) 



r(x — Xo ) if x > xo, 

—l(x — xo ) if x < xo; 



( 7 ) 



where l + r > 0, to ensure convexity. It is easy to see that the family of 
functions defined in (7) is one of the many possible generalizations of the 
absolute value function |x|, the former being equivalent to the latter whenever 
Xo = 0 and l = r = 1. Similarly to \x\, g(x;xo,l,r) can be majorized by a 
quadratic ax 2 + bx + c with coefficients a > 0, b and c determined from the 
majorization requirements (see an example in Figure 1, right panel). Finally, 
— Sb(T ) expressed in terms of the above quadratics can be majorized by the 
following function, written in matrix notation as: 



r-s B (T,T ) = tr(T t Z t GZT) - tr (T t Z t CZT) + K 2 , 



(8) 
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where Z is the matrix obtained by joining X and Y together, row-wise, and 
G, C are design matrices dependent on T, whose non-zero elements rriij are: 



rriij = < 



Pij 

Pij 

Nx+Ny 

T m ik 
k=l,k^i 



for i £ [1; N x \ and j £ [N x + 1; N], 
for i £ [ N x + 1; N] and j £ [1; N x ], 

for i = j, 



(9) 



where p z j is equal to —1 and —1/ (ci®(T)) 2 for C and G, respectively (see 
Kosinov (2003) for derivation details and a description of an alternative faster 
method based on Taylor series expansion) . 

Finally, combining results (6) and (8), we obtain a majorizing function of 
the log J(T) optimization criterion: 

P\ogj(T, T) cxpsw d~ PP—Sb 

= |tr ( T t X t BXT ) + ptY(T T Z T GZT) 

-(3tr(T T Z T CZf) + K 3 , (10) 

that is used to find an optimal transformation T minimizing log J(T) criterion 
via the iterative procedure described in Heiser (1995), and, thus, constitutes 
the core of the proposed distance-based discriminant analysis (DDA) method. 

While at every iteration it is possible to minimize (10) by solving a system 
of linear equations, it is often recommended, Krogh and Hertz (1992), that 
a length-constrained solution be found, especially in the case of classifiers 
capable of achieving zero training error, to prevent overfitting. By incorpo- 
rating the constraint into the Lagrangian, we obtain a standard trust-region 
subproblem, for which efficient solution methods exist, Rojas et al. (2000), 
Hager (2001). 

4 Dimensionality reduction and multiple-class setting 

For any T £ K mxfe , k < m, the proposed method has an additional advan- 
tage of being a dimensionality reduction technique. Moreover, the value of 
fc, i.e. , the exact number of dimensions the data can be reduced to with- 
out loss of discriminatory power with respect to (3), is precisely determined 
by the number of non-zero singular values of T . Indeed, the distances be- 
tween the transformed observations may be viewed as distances between the 
original observations in a different metric TT t , that can be expressed as 
TT t = USV T VSU T — UkS'fJJ'l using the singular value decomposition of 
T. The obtained expression reveals that the effect of the full-dimensional 
transformation T is captured by the first k left-singular vectors of T scaled 
by the corresponding non-zero singular values, whose number gives an answer 
to the question of how many dimensions are needed in the transformed space. 
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While the above discussion is concentrated mostly on the two-class con- 
figuration, it is straightforward to generalize the presented formulation to a 
multiple-class discriminant analysis setting, for the number of classes K > 2: 



K - 1 

log J k (T) = J2 (a {l) Sw(T) (i) - P W S B (T)«) . (11) 

2=1 



5 Experimental results 

Our empirical analysis was based on data sets from the UCI Machine Learning 
Repository, Blake and Merz (1998). First of all, we verified that the solutions 




(a) log J = —0.17 (b) log J = —0.22 (c) log J = —0.19 

Fig. 2. Two-dimensional discriminative projections of the Sonar data set: inferior 
solutions found by the gradient descent method 



of the optimization problem formulated in Section 2 found by the proposed 
method were of better quality compared to those of generic techniques, con- 
firming the results reported by Van Deun and Groenen (2003), and Webb 
(1995). Indeed, numerous random initializations of the gradient search led to 
inferior as well as unstable results reflected in higher values of log J (see Fig- 
ure 2), while the IM-based method proved nearly insensitive to the choice of 
the initial supporting point and regularly reached far lower criterion values 
maintaining convergence property at all times. We also validated the pro- 




Fig. 3. Dimensionality reduction experiments: classification performance results 
(left) and singular values of T G R mxm (right). The dashed lines mark the boundary 
that determines the dimensionality of the transformed space. 
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posed dimensionality reduction technique by analysing how the classification 
performance varied with respect to k, the dimensionality of the transformed 
space, and how it was related to the number of non-zero singular values of the 
full-dimensional transformation, an example of which for the Sonar data set is 
depicted in Figure 3. The right pane plots 10 largest out of 60 singular values 
of the full-dimensional transformation, in descending order, while the left di- 
agram shows the results of 10-fold cross-validation experiments with respect 
to the transformed space dimensionality. Dot-filled bars denote performance 
achieved by fixing k a priori, while shaded bars show results obtained from 
a fc-truncated SVD of the full-dimensional transformation. It is easy to see 
that the singular values beyond the 7 th are virtually zero. And as the dia- 
gram on the left confirms, adding dimensions beyond 7 no longer improves 
the classification performance (confirmed by Chow test at 99% confidence). 

The experiments with the rest of the UCI data sets compared 10-fold 
cross-validation classification performance of the nearest neighbor classifier 
in the original feature space (denoted as NN) and that achieved in the trans- 
formed space derived by the proposed distance-based discriminant analysis 
method (denoted henceforth as DDA+NN). Hence, the goal of this analysis 
was to assess the effect of applying a DDA transformation on the accuracy 
of the NN classifier. The error rates of NN and DDA+NN data classifica- 
tion experiments are presented in Table 1, showing a consistent improvement 



Table 1 . Classification results for UCI data sets 



Data set 


Classes 


% Error of NN 


% Error of DDA+NN 


Hepatitis 


2 


29.57 


0.00 


Ionosphere 


2 


13.56 


7.14 


Diabetes 


2 


30.39 


27.11 


Heart 


2 


40.74 


21.11 


Monk’s PI 


2 


14.58 


0.69 


Balance 


3 


21.45 


3.06 


Iris 


3 


4.00 


3.33 


DNA 


3 


23.86 


6.07 


Vehicle 


4 


35.58 


24.70 



in performance. A separate set of experiments (see Kosinov (2003) for de- 
tails) using the ETH80 database also revealed the importance of the length 
constraint, introduced in Section 3 to avoid overfitting. The results of these 
tests demonstrated up to 20% better classification accuracy for the length- 
constrained version of the method. Additionally, the results of our more recent 
experiments reveal that the DDA combined with an SVM classifier, Cristian- 
ini and Shawe-Taylor (2000), produces a smaller number of support vectors 
leading to better classification accuracy. 
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Abstract. The CHAID algorithm has proven to be an effective approach for ob- 
taining a quick but meaningful segmentation where segments are defined in terms 
of demographic or other variables that are predictive of a single categorical crite- 
rion (dependent) variable. However, response data may contain ratings or purchase 
history on several products, or, in discrete choice experiments, preferences among 
alternatives in each of several choice sets. We propose an efficient hybrid method- 
ology combining features of CHAID and latent class modeling (LCM) to build a 
classification tree that is predictive of multiple criteria. The resulting method pro- 
vides an alternative to the standard method of profiling latent classes in LCM 
through the inclusion of (active) covariates. 



1 Background and summary of approach 

The CHAID (Chi-Squared Automatic Interaction Detection) tree-based seg- 
mentation technique has been found to be an effective approach for obtaining 
meaningful segments that are predictive of a AT -category (nominal or ordinal) 
criterion variable. For example, the dependent variable might be response to 
a mailing (responders vs. non-responders). Each of the resulting segments, 
depicted as a terminal node in a tree diagram, is defined as a combination of 
directly observable categorical predictors such as AGE = 18-24 & INCOME 
= $80,000+. Descriptive entries in each tree node consist of the sample size 
and the corresponding observed distribution on the dependent variable (e.g., 
associated response rate). 

Latent class (LC) models are useful in identifying segments that underlie 
multiple response variables. While the resulting latent classes can be either 
ordered (ordinal latent variable) or unordered (nominal latent variable), they 
are not actionable like CHAID segments, because by definition they are un- 
observable (latent). 

In this paper we propose a hybrid methodology that combines strengths 
of both approaches. After decomposing a set of M response variables into K 
underlying latent class segments, a modified CHAID algorithm is used with 
the K latent classes serving as the IT-category nominal (ordinal) criterion 
variable. The resulting CHAID segments, derived from selected demographic 
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or other exogenous variables that are predictive of the classes, should also 
tend to be predictive of the M criterion variables. 

The hybrid method also provides an alternative to the use of covariates 
in LCM to profile the classes. In practice, one or more demographic or other 
exogenous variables are included in an LCM to describe/predict the latent 
classes using a multinomial logit model. The proposed CHAID-based alterna- 
tive is especially advantageous when the number of covariates is large, when 
covariate effects are non- linear, or when there are complicated higher-order 
interactions. 

In the next section we provide brief introductions to the standard CHAID 
algorithm and the standard LC (cluster and factor) models. We then provide 
the technical details of the hybrid approach, followed by an empirical example 
from a pre-post survey (Burns et al. (2001)). We conclude with some final 
remarks. 



2 The CHAID algorithm 

The original CHAID algorithm was introduced by Kass (1980) for nomi- 
nal dependent variables. CHAID is a recursive partitioning method useful 
in exploratory analyses that relate a potentially large number of categorical 
predictor variables to a single categorical nominal dependent variable. It was 
extended to ordinal dependent variables by Magidson (1993) who illustrated 
how this extension could be used to take advantage of fixed scores such as 
profitability, for each category of the dependent variable when such scores are 
known, as well as how to estimate meaningful scores when category scores 
are unknown. Chi-squared goodness of fit tests are used to identify signifi- 
cant predictors, and to merge predictor categories that do not differ in their 
prediction of the dependent variable. 

Predictor categories are eligible to be merged according to specified scale 
types. Any categories of Nominal (“free”) predictors can be merged, while 
only adjacent categories of ordinal or grouped continuous (“monotonic”) pre- 
dictors are allowed to merge. A final scale type (“float”) may be used to 
specify that the variable is to be treated as monotonic except for the final 
category, often corresponding to a ‘don’t know’ or ‘missing’ response, which 
is free to merge with any of the other categories. Technical settings include 
significance levels associated with merging and splitting and a stopping rule. 
A case weight and a frequency variable may also be included in the analysis. 

As an example, Figure 1 illustrates a CHAID analysis based on data from 
a post-election survey on 1,051 persons who voted for either Bush or Gore 
in the 2000 U.S. election. The dependent variable (VOTE) is the candidate 
voted for and the predictors are 5 demographic variables: 1) MARSTAT 
(l=married, 2=widowed, 3= separated/divorced, 4= never married, 5= other 
- “Free”), AGEr (1=18-24, 2=25-34, 3=35-44, 4= 45-54, 5= 55-64, 6=65+, 
‘.’ = refused - “Float”), GENDER (1 = male, 2 = female), EDUCATION 
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2 3 



Fig. 1 . CHAID tree for VOTE. 



(1 = less than HS, 2= HS grad, 3= some college, 4=college grad, 5= post 
grad, 5-refused - “Float”), and EMPLOYED (1 = Yes, 2 = No, 3 = retired 
- “Free”). 

Overall, 48.2% voted for Bush. This is displayed in the top (root node) of 
the tree. Among the 5 demographic predictors included in this analysis, only 
2 were significant at the root node - MARSTAT (p<. 00001), and GENDER 
(p<.01). The CHAID analysis resulted in 4 segments. The best segments 
for Bush are SI, consisting of the 673 married voters (53.94% for Bush) 
and S2 consisting of 100 unmarried employed males (53.67% for Bush). The 
remaining segments - S3 (unmarried unemployed males) and S4 (unmarried 
females) - voted more than 2:1 in favor of Gore over Bush. 

One limitation of CHAID is that segments are defined based on a single 
criterion variable. Given situations where multiple criteria exist, it is not clear 
how one should go about obtaining a single common segmentation. Using one 
dependent variable as the criterion may result in one set of segments, while 
use of an alternative dependent variable will likely yield a different set of 
segments. Moreover, the categories of a predictor may merge in different ways 
depending upon which dependent variable is used, again leading to different 
segments. 

In addition, when multiple dependent variables do exist, they may be 
of different scale types (nominal, ordinal, continuous, count, etc.). Using a 
3-category response variable as an example Magidson (1993) showed that 
CHAID segments resulting from treating the dependent variable as ordinal 
(using profitability scores for the categories) differed substantially from seg- 
ments derived from the nominal algorithm which ignored the scores. The 
hybrid approach resolves the need to chose between different segmentations 
because indicators with differing scale types can be used in extended LCMs, 
yielding a single LC solution. An important advantage of this hybrid approach 
over approaches based on specific measures for node homogeneity rather than 
a model (e.g., Kim and Lee (2003)) is that the LC model used here can handle 
dependent variables of different scale types. 
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3 Latent class modeling 



A LC model postulates a nominal A'-category latent (unobservable) variable 
X to explain the associations/correlations between the observed response 
variables (multiple criteria; Lazarsfeld and Henry (1968); Goodman (1974)). 
Each category of X is called a latent class. Let Y m denote one of M nominal 
response variables, m = 1,2 ...,M; j m is a particular response category and 
J m the number of categories of variable Y m . Notation Y and j is used to refer 
to a full response vector and a full set of response categories. The LC model 
for M response variables is defined as 

K K 

P(Y = j) = Y, P ( X = k,Y=j) = J2 P( x = k)P(Y =j\X= k ) 

k = 1 k = 1 

K M 

= E p ( x = fc ) n p{ y™ = i ™\ x = fc ) > (!) 

k— 1 m— 1 



where P(X = k) denotes the probability of being in latent class k, k = 
1,2,...,K, and P(Y m = j m \X = k) denotes the conditional probability of 
obtaining the j m th response to item Y mi from members of class k, j m = 
1 2 T 

Cases with response pattern j are typically classified into the latent class 
for which the posterior membership probability P(X = fc|Y = j) is highest. 
Estimates for the posterior membership probabilities - for k = 1,2,..., A' - 
can be obtained using Bayes theorem as follows: 



P(X = k |Y=j) 



P(X = k, Y=j) 
P(*= j) 



(2) 



The numerator and denominator were defined in equation (1). 

Recent advances allow for dependent variables (indicators) of varying scale 
types to be used including mixing categorical, continuous, and count vari- 
ables - by specifying the appropriate probability densities P(Y m = j rn \ X = k) 
(Vermunt and Magidson (2002)). By expressiong the mean of these densities 
in terms of a generalized linear model (GLM), one can include direct effects 
between 2 or more indicators, multiple categorical latent variables, contin- 
uous latent factors and/or other terms into the model (see Magidson and 
Vermunt (2001); Vermunt and Magidson, (2005)). 

It is also possible to include one or more exogenous variables called covari- 
ates in a LCM, allowing one to explore the relationship between exogenous 
variables and the latent classes and assess the significance of such relation- 
ships in a formal way. However, the covariates included in LCM influence 
the estimates of the parameters in the original measurement model. If the 
covariate part of the model holds true, inclusion of the covariates improves 
the efficiency of the estimates. However, if it is misspecified, the estimates 
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may become somewhat biased. In addition, profiling latent classes in terms 
of many covariates may cause the solution to become unstable. As an alter- 
native, Magidson and Vermunt (2001) allow covariates to be treated in an 
inactive manner - providing appropriate cross-tabulations but not influenc- 
ing the original measurement model. But this approach comes at the expense 
of no longer being able to assess statistical significance. 

In the next section, we show how the hybrid algorithm provides an al- 
ternative treatment to the use of both active and inactive covariates in LC 
models. The new approach provides an assessment of statistical significance 
for selected covariates included within the LCM framework, whether the co- 
variate is specified as active or inactive. Those covariates specified as inactive 
do not alter the estimates obtained from the LCM. 

4 The hybrid CHAID algorithm 

Our hybrid CHAID algorithm involves 3 steps. 

1. Perform an LC cluster analysis on M response variables to obtain K 
latent classes. 

2. Perform a CHAID analysis using the K classes as a nominal dependent 
variable. 

3. Obtain predictions for each of M response variables based on the resulting 
CHAID segments and/or on any preliminary set of CHAID segments. 

Step 1 yields class-specific predicted probabilities for each category of the 
m-th dependent variable 1 , as well as posterior membership probabilities for 
each case. 

Step 2 yields a set of CHAID segments that differ with respect to their 
average posterior membership probabilities for each class. We use the poste- 
rior membership probabilities defined in equation (2) as fixed case weights as 
opposed to the modal assignment into one of the I\ classes. This weighting 
eliminates bias due to the misclassification error that occurs if cases were 
equated (with probability one) to that segment having the highest posterior 
probability. Specifically, each case contributes K records to the data, the fcth 
record of which contains the value k for the dependent variable, and contains 
a case weight of P ( X = k\Y = j), the posterior membership probability 
associated with that case. Thus, as opposed to the original algorithm where 
chi-square is calculated on observed 2-way tables, in the hybrid algorithm, 
the chi-squared statistic is computed on 2-way tables of weighted cell counts. 2 

If as an alternative to performing a standard LC analysis, one performs 
an LC factor analysis in step 1, in step 2 the CHAID ordinal algorithm can 

1 When one or more of the dependent variables are quantitative, for each class this 
step also yields predicted means for the quantitative dependent variables. 

2 The new algorithm also incorporates sampling weights, if present, using an effi- 
cient ML algorithm proposed by Vermunt and Magidson (2001). 
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Fig. 2. Hybrid CHAID tree for 11 dependent variables. 



be used to obtain segments based on the use of any of the LC factors as the 
ordinal dependent variable, or a single segmentation can be obtained using 
the nominal algorithm to identify segments based on the single joint latent 
variable defined as a combination of two or more identified LC factors. 

Step 3 involves obtaining predictions for any or all of the M dependent 
variables for each of the I CHAID segments by cross-tabulating the resulting 
CHAID segments by the desired dependent variable(s). An alternative is to 
obtain predictions as follows 



K 

P(Y m = j\i) = Y, P ( Y ™ = 3 \X = k)P(X = k\i). 

k = 1 

As can be seen, we compute a weighted average of the class-specific distri- 
butions for dependent variable Y m obtained in step 1 [P(Y m = j\X = k)}, 
with the average posterior membership probabilities obtained in step 2 for 
segment i being used as the weights [P(X = fc|*)]. 



5 Empirical example 

Among other questions, the pre-election survey solicited ratings for each can- 
didate on 5 attributes - leadership, caring, knowledge, honesty and morality. 
A LCM was fit to these data, using VOTE as an active covariate, and the 
5 demographics as inactive covariates. This model may be viewed as a kind 
of unsupervised regression with 11 dependent variables - VOTE, plus the 
10 attribute ratings. This LCM yielded 3 segments. The first segment (32%) 
favored Gore, the second (39%) was neutral and the third favored Bush with 
respect to the attribute ratings and in their votes. These percentages are 
displayed in the root node of the hybrid CHAID tree in Figure 2. 

The hybrid CHAID used the 3-category latent variable (segments) as the 
dependent variable and again utilized the 5 demographics as the predictors. 
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Fig. 3. Hybrid CHAID tree for VOTE. 



At the root node 3 of the 5 predictors were found to be significant - MAR- 
STAT (p < .00002), AGEr (pc.OOl), and GENDER (p=.01). Compared to 
our earlier CHAID, age is more important than when VOTE was the only de- 
pendent variable. The hybrid CHAID analysis resulted in 6 segments (Figure 
2). Since the attributes are now included as additional dependent variables 
(the latent classes are a proxy for these dependent variables) we might expect 
that the resulting segments might predict any single dependent variable less 
well than CHAID based on only that dependent variable. 

Figure 3 shows how the 6 hybrid segments predict VOTE. To compare this 
to the predictions based on our original segments (Figure 1) we first compare 
those segments favorable to Bush. Our previous analysis identified segments 
SI and S2 as favorable to Bush. The hybrid CHAID (Figure 3) identifies 3 
segments most likely to vote for Bush - segments 1, 2 and 3. Note that these 
3 segments combined, are equivalent to the original segment SI. Since the 
hybrid CHAID fails to yield any additional segments that prefer Bush such 
as S2, it appears that the hybrid segmentation predicts VOTE less well than 
the original CHAID. Similarly, focusing on segments most favorable to Gore, 
our previous CHAID identified S3 and S4 (n= 277 cases) as favoring Gore 
by more than 2:1. The hybrid CHAID finds segments 4, 5 and 6 as favoring 
Gore, but not by as much as 2:1 over Bush. 

6 Final comments 

In this paper, we introduced a hybrid CHAID algorithm 3 as an extension 
of CHAID to multiple dependent variables of possibly differing scale types. 
Alternatively, this hybrid algorithm could be described as an alternative to 
the standard treatment of active and/or inactive covariates in LCM. The 

3 The extended CHAID algorithm has been implemented in a commercially avail- 
able computer program called SI-CHAID 4.0, and works in conjunction with the 
latent class programs Latent GOLD 4.0 and Latent GOLD Choice 4.0. 
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CHAID-type output can simplify the process of examining the relationship 
between the demographics and/or other exogenous variables and the latent 
segments by 1) ranking the covariates from most to least significant and 2) 
for each covariate, merging categories that are not significantly different. This 
new output is especially valuable when the number of covariates is large. 

We illustrated the hybrid algorithm here with dependent variables con- 
sisting of favorability ratings of Bush and Gore on 5 attributes plus the actual 
vote among 1,051 voters in the 2000 U.S. election. We showed how the hybrid 
CHAID provides a unique segmentation. We showed how it compares with 
a segmentation obtained using the traditional CHAID algorithm for a single 
dependent variable - VOTE. The results suggest that the segments resulting 
from the hybrid CHAID may fall somewhat short of predictability of any 
single dependent variable in comparison to the original algorithm, but makes 
up for this by providing a single unique set of segments that are predictive 
of all dependent variables. 
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Abstract. Several possibilities of defining the expectation of random p-dimension- 
al intervals are proposed. After defining the expectation via reducing intervals to 
their extremal points p-dimensional intervals (rectangles) are treated as Random 
Closed Sets (RCSs). In this framework Random Closed Rectangles (RCRs) are 
defined and the properties of different definitions for expectations of RCSs, applied 
on RCRs are studied. In addition known mean values of interval data are integrated 
in this generalized approach. 



1 Introduction 

Clustering methods often use class representatives or prototypes to describe 
data clusters. Prototypes are involved in many clustering criteria, where the 
dissimilarity between a data point and a cluster representative is considered. 
Moreover, the properties of a cluster are often characterised briefly by one 
single data point, e.g. the class centroid. 

There are several clustering methods preparing p-dimensional interval data 
Xi , . . . , x n with 

Xi = [aj,6i], at < bi £ IR P (that means dij < bij V*, j), 

• — [@2,1 ; , 1 ] ^ ‘ X [di^p , bi^p\ , i — 1 , . . . , 71. 

For instance these data could be daily meteoroligical data (atmospheric pres- 
sure, temperature, air humidity, etc.) of a certain city, or medical data (blood 
pressure, temperature,...) of different patients. These data consist of many 
different measuremnts which are contained in an interval. 

When preparing p-dimensional interval data with a certain clustering method 
one is correspondingly searching for the mean of intervals as the representa- 
tive of all intervall data in a class. This leads to the question how the mean of 
some p-dimensional intervals is defined. We will introduce two different ways 
of defining a mean of (p-dimensional) intervals, which also can be taken by 
hyper-rectangles. First we reduce an interval to its minimum and its maxi- 
mum value to shift the problem to the case of real- valued data, where the 
definition of expectation and mean is well known. In the second approach 
we treat an interval as a a special form of closed (convex) sets and use the 
theory of Random Closed Sets (RCSs) to define the mean via some different 
definitions of expectation. 
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2 Reduction to characteristic points 

In this chapter a very obvious possibility to average a set of p-dimensional 
rectangles is studied. Like circles can be characterised only by their midpoint 
and their radius rectangles can also be reduced to some few points. Hence, we 

want to use the lower left vertex and the upper right vertex to characterise 

a subset of IR P by only two p-dimensional real-valued vectors. To put this 
approach in a more formal framework we use the transformations t,f _1 to 
switch between rectangles and pairs of p-dimensioiial vectors. On 
Q := {Q C 1R P | Q = [a,b],a < b £ IR P } we define 

t : Q — ► IR P x 1R P , (1) 

t(Q) = t([a,b]) := ( a,b ) VQ £ Q, 

r 1 : {(a, b)£l R p x 1R P | a < b} — ♦ Q, (2) 

t _1 ((a, &)) := [a,b] V(a, b) £ {(a, b) £ 1R P x IR P | a <b}. 

Definition 1. Let (17, 21, P) be a probability space and X = (H, B) : (17, 21) 
— » (1R P x 1R P , *B 2p ) a random variable which satisfies 

A < B a.s. (3) 

Then we call X a Random Point Rectangle (RPR). 

Definition 2. Let X = (A, B) a Random Point Rectangle with the property 
that A,B are integrable. Then the expectation of X is defined as 

E[X]:=(E[H],E[P]). (4) 

Remark 1. Treating a p-dimensional interval as a pair of points gives us the 
ability to obtain a definition for the mean of a finite set of rectangles. 

Let {Qi, ■ ■ ■ , Q n } be a set of p-dimensional intervals and M(-) the empir- 
ical mean (corresponding to Definition 2 of expectation) of a finite set of 
p-dimensional RPRs. Then we obtain 

M := t _1 (M(t((5 1 ), . . . , t(Q n ))) (5) 

as a mean of p-dimensional intervals. This result coincides with the intuitive 
way to built the mean of finitely many rectangles and this identicalness leads 
from the also intuitive construction of E. Later in Section 4 we will get the 
same mean from a more general construction of expectation. 



3 Several expectations of Random Closed Sets 

Bearing in mind that the source of the treated rectangles can be values which 
shall be represented by their complete spectrum and not only by their ex- 
tremal points now we treat the rectangles as ’real’ subsets of 1R P . 
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According to this proceeding one considers p-dimensional intervals as reali- 
sations of set-valued random variables. We will introduce briefly to the more 
general theory of Random Closed Sets (RCSs) based on Matheron (Matheron 
(1975), Stoyan and Mecke (1983)) to provide a basis for several definitions of 
expectations. 

Let F be the system of all closed subsets in IR P and K, the system of all 
compact subsets. Then we consider the cr-algebra ^ on F which contains for 
all K G 1C: 

F K :={Fg F\ Fnif/0}. (6) 

Definition 3. A Random Closed Set (RCS) is a random variable X with 
values in (F, S') . 

X is called convex , if X is convex almost surely. 

X is called a Random Closed Rectangle ( RCR ), if X is a closed rectangle 
almost surely. 

Remark 2. The distribution P x of a RCS X is determined by knowing 
P X (F K ) for all K G JC. 

3.1 The Aumann expectation 

Definition 4. Let (17, 21, P) be a probability space and X : (17,21) — > (F, S) 
a RCS. A random point <j> : (17,21) — > (1R P , Q3 P ) is called selection of X , if 



4>(u>) G X(u>) a.s. (7) 

Definition 5. Let (17,21, P) be a probability space, X : (17,21) — > {F, S) a 
RCS and &X the set of all selections of X. Then 

Ea[A”] := {E[0] | (j> G $x} (8) 

is called the Aumann expectation of X. 

3.2 The Frechet expectation 

Definition 6. For closed A, B C 1R P and with d e : Et p x ]R P — > 1R + the 



euclidean distance we can define the Hausdorff-distance between A and B as: 
d(A,B) := maxjmaxmin d e (x, y) : maxmin d e (x, y)}. (9) 

x£A yEB y£B x£A 

Let be X a Random Closed Set. Then the solution Kq of 

E [d(X, K 0 ) 2 } = min E [d(X, K) 2 } (10) 

is called the Frechet expectation E/r [A] of X. 
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Remark 3. Although the Frechet expectation is in general very hard to spec- 
ify, because it is a solution (existence, uniqueness?) of a hard optimisation 
problem it can be generalised to a whole class of different expectations, if the 
type of distance is changed. We can consider the Aumann expectation as a 
special case (see Molchanov (1997)) and moreover, we are able to choose spe- 
cial distances for rectangles such that the expectation of RCRs has a closed 
solution (see Section 4). 

The second advantage of this definition of expectation is the fact that we 
obtain a formulation of a corresponding variance, too. If 

V F := min E \d(X, K) 2 } (11) 

Ke > c 

is considered as a Frechet- Variance, it is possible to show several properties 
of this object the variance of a real-valued random variable has. 



3.3 The Doss expectation 

As the kinds of expectation introduced above the Doss expectation is also 
defined via a distance measure. 

Definition 7. Let (I?, 21, P) be a probability space and X : (J7, 21) — > (F, (?) 
a RCS. Consider for an arbitrary x £ 1R P 

M x := {y € IR P | d e {x,y) < E[<f(x, A)]}, (12) 

then 

E d[X)-.= p| M x = {y gTRP\ d e (x,y) <V[d{x,X)}Vx GW} (13) 

xeiRp 

is called the Doss expectation of X . 

Remark C In analogy to the Frechet expectation it is possible to vary the 
distance measure. A whole family of different expectations of RCSs can be 
obtained that way, but there are even more possibilities. Instead of using 
the expectation E[d(x,X)] in (12) it would be possible to use an arbitrary 
functional on IR P . 

Theorem 1. The Doss expectation of a Random Closed Set is convex. 
Proof, see Nordhoff (2003). 

3.4 The Vorob’ev expectation 

In contrast to the other presented expectations the Vorob’ev expextation 
is defined via the volume of a Random Closed Set. As the volume of the 
boundary is zero, the boundary of the RCS will not play that important role 
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as in the other definitions of expectation. 

The Vorob’ev expectation uses the characteristic function of a RCS, viz the 
function lx ■ 1R P xfi-»{0,l} with 



Aa'mO) 



1, if x £ X(u>), 
0, else. 



(14) 



Definition 8. Let (17,21,P) be a probability space and X : (17,2 1) — > (F, S) 
a RCS. Then the function px : 1R P — > [0,1] with 

Px(x) :=E[l x (x)]= f H X (u)(x) d,P(u) = P( x & X) (15) 

J a 

is called cover function of X . 

Definition 9. Let (17,21 ,P) be a probability space, X : (17,21) — > (F, S) a 
RCS and px ■ 1R P — > [0, 1] the cover function of X. For an arbitrary q £ [0, 1] 
and the Lebesgue-measure A p in 1R P one considers 



Lq{X) := {a; £ 1R P | px{x) > q} and (16) 

qo := ini{q £ [0,1] | A P (L,(X)) < E[A P (X)]}. (17) 

Then the Vorob’ev expectation Ey[X] of X is defined by 

E v [X]:=L qo . (18) 

Remark 5. The set Li(X) of a compact RCS X is often named Median (see 
Stoyan and Stoyan (1994)), because it minimises the expected volume of the 
symmetric difference between X and a Borel-set. 



4 Expectations of Random Closed Rectangles 

After introducing several definitions of epectations for RCSs one is interested 
in the behaviour of theses expectations, if the underlying sets are Random 
Closed Rectangles. 

4.1 The Aumann expectation 

A very pleasant property of the Aumann expectation and the resulting mean 
for a finite number of fixed closed rectangles is the easy evaluation. More pre- 
cisely they coincide with the expectation/mean of Definition 2 and Remark 
1 . 

Theorem 2. Let (17,21, P) be a probability space, X : (17,21) — > (F, S') a 
RCR and the functions defined like in (2) und (3). If the expectation 

E[t{X)] exists, it is applied to the Aumann- Expectation: 

E A [X}=t-\E[t{X)}). 



(19) 
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Proof. Let (17, 21, P) be a probability space, T the system of closed sets in 
IR P and let for the RCR X : (17, 21) -> (P, S') 

X(iv) = [A(uj),B(w)) 

= [A 1 (uj),B 1 (uj)\ x [A 2 (uj),B 2 (u)\ x ••• x [Ap(u)),B p (u>)] 
with A(ui) < B{to) £ 1R P a.s. be valid. 

To raise survey we abandon the transformations t, t . That means, we write 
E[X] instead of t~ 1 (E[t(X)]) in this proof, although the expectation E[-] is 
defined on the set of Random Point Rectangles. 

Like in Chapter 2 it is 

E[X\ := [E[A},E[B]} = [P[M £7[Br]] x ••• x [E[A p ], E[B p ]}. (20) 

We want to show: Ea[X\ = E[X\. 

C: We consider x £ Ea{X], then there is a selection </> : (17,21) — > (1R P , < B P ) 
of X with x = E[(j>] and <j>{ui) £ X(uj) a.s. Thus 



E[A] = f A(u>) dP(u > ) < f 4>(u>) dP{u>) = x (21) 

J n J n 

< [ B(u>) dP{u) = E[B ], 

J a 

because A(ui) < < B{u>) a.s. It follows from (21) that 

x£E[X}. (22) 



D : Consider now x £ E[X], then there are real-valued 0 < < 1 

satisfying 



/ t\E[A{\ + (1 — ti)E[Bi\ \ 


= E 


‘ /tiAiV 


+ E 




/(I 


\ t p E[Ap\ + (1 — tp)E[Bp\ ) 




\ tpA p J 






V (1 — tp)B p / _ 



We choose now <j> : (17, 21) — » (1R P , 25 p ) with 






( t\A\(uf) 

\ tpAp(uj) 



/(l-tOPiMX 

: Vw G 17. 

\ (1 — tp)B p (io) J 



Then <j> is a selection of X, due to </>(w) £ X(u>) Vw G 17 and we obtain 
x = E[(j>]. So we can conclude that x £ Ea[X). 



With the aid of Theorem 2 the canonical definition of Aumann mean 
can be replaced by a simple representation, if the underlying objects are 
rectangles. 
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Proposition 1. Let Q \, . . . , Q n be fixed closed p-dimensional intervals. To 
get a definition for ’mean’ via Aumann expectation we construct a finite prob- 
ability space (17, 21, P) with f2 = {u>i, . . . ,w n }, 21 = fp(/7) and P as the 
uniform distribution on 17. By defining a Random Closed Rectangle X with 
X(oji ) := Qi, i = 1, . . . , n, we get the Aumann mean Ma(Qi , ■ ■ ■ , Q n ) '■= 
Ea[X] 7 and taking note of Theorem 2 we obtain 



4.2 The Frechet expectation 

As proposed in Remark 3 it is possible to define various kinds of expectation 
and mean, if we consider several distance measures between sets. In the case of 
p-dimensional intervals there are several distance measures between intervals 
(for example see Chavent (2000)). 

Example 1. Now, we take a look at the distance d : Q x Q — > 1R+, satisfying 



Looking for a closed form of the Frechet expectation of RCRs with respect 
to this special kind of distance-measure one can show easily (see Nordhoff 
(2003)) that this expectation coincides with the expectation which is defined 
in Definition 2. 

In the case of a finite number of rectangles the optimisation problem which 
is connected to the Frechet expectation is treated in Chavent and Lechevallier 
(2002) for a special form of Hausdorff-distance. 

4.3 The Doss expectation 

Like the Aumann expectation and the version of Frechet expectation in Ex- 
ample 1 the Doss-Expectaion of Random Closed Rectangles coincides under 
the assumption of uniform boundedness with the expectation which is de- 
fined in Def. 2 (for details see Nordhoff (2003). Therefore the Doss mean of a 
finite number of closed rectangles which is defined in a canonical way like the 
Aumann mean (via constructing a uniformly bounded RCR) is concordant 
with the empirical mean of the corresponding RPRs. 



Ma(Qi, ■ ■ ■ , Q n ) — M(Q i , • • • , Qn)- 



(23) 



p 




a<b 7 a' <b' e IRC (24) 



4.4 The Vorob’ev expectation 

The Vorob’ev expectation depends on the volume of the RCR and therefore 
does not conserve the shape as the following simple example shows. 
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Example 2. In the case of p = 2 let be X the (convex) RCR with 



XM = 



Or := [1,2] x [1,2], 
Q 2 := [3,6] x [1,8], 



with probability g, 
with probability tj . 



(25) 



Then the Vorob’ev expectation of X is Ey[X] = Q\ U Q 2l in particular it is 
no rectangle and not convex. 



Additionally in case of simple probability spaces , the ’approximation of 
the expected volume’ often fails. So this kind of expectation is not suited to 
built a mean of random rectangles. 



5 Discussion 

We have considered the problem of building a mean of p-dimensional rectan- 
gles in a more general framework. With the aid of Random Closed sets the 
intuitive way of building the mean can be embedded as a special case. This 
fact legitimates the intuitive approach and in some cases it specifies a closed 
form for expectations of Random Closed Sets. 

There are more imaginable approaches to define an expectation for RCSs 
and thus a mean of p-dimensional intervals, but it has to be analysed if the 
resulting mean has reasonable properties. The different kinds of expectations 
considered in this paper have shown that only those expectations of RCSs 
seem useful for RCRs which take the shape of the sets into account. 
Furthermore, in this paper we always use the empirical mean as a standard 
estimator for the expectation of RCSs. But taking other statistical models 
(Stoyan and Mecke (1983)) into account one could use an estimated distrib- 
ution of RCRs to build the mean of intervals. 
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Abstract. This paper analyses the influence of 13 stylized facts of the German 
economy on the West German business cycles from 1955 to 1994. The method 
used in this investigation is Statistical Experimental Design with orthogonal fac- 
tors. We are looking for all existing Plackett-Burman designs realizable by coded 
observations of these data. The plans are then analysed by regression with forward 
selection and various classification methods to extract the relevant variables for 
separating upswing and downswing of the cycles. The results are compared with 
already existing studies on this topic. 



1 Introduction 

In the following, existing data are analysed using the method of statistical 
experimental design. The aim of experimental design is to estimate factor 
effects with the highest accuracy possible. Usually, an experimental design 
with fixed factor levels is taken and the response of the experiment is used 
to find factors of high influence with as few experiments as possible. Thus 
the optimal factors determining the response are found faster and with less 
expense than by carrying out all experiments with all possible factor level 
combinations. In order to detect the variables which do influence the up- and 
downswing phases of the economy, we use a special type of screening plans, 
namely Plackett-Burman plans. Contrary to the method of full factorial de- 
signs, which investigate main effects and all possible interactions, these plans 
are employed to find only the main effects in the model. 

The original data used here are highly correlated. In order to eliminate 
these correlations, the data are coded by -1 and +1 only and then special ob- 
servations are selected building Plackett-Burman plans. The main advantage 
of this method is that it selects the most important factors not disturbed by 
correlations in the data. By this procedure, on the one hand, the data are 
reduced by the discrete coding by -1 and +1 and on the other hand by choos- 
ing special observations only. In order to at least partially compensate this, 
we are analysing all existing Plackett-Burman plans with respect to the data 

* This work has been supported by the Deutsche Forschungsgemeinschaft, Sonder- 
forschungsbereich 475. We also thank Uwe Ligges and Karsten Luebke for their 
support. 
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and finally choose those variables which are, what we call, uniquely correlated 
to the up- and down phases of the economy. The following investigations are 
based on 13 stylized facts of the West German economy (cf. Heilemann and 
Munch (1999)) which have been selected by Heilemann and Munch to ex- 
plain the German business cycle. There exists already a number of papers 
which analyse and interpret these data based on, e.g. classification methods 
like linear discriminant analysis and time series analysis (cp. Heilemann and 
Munch (1999), Weihs, Rohl and Theis (1999), Weihs and Garczarek (2002)). 

In this paper in the first step, we code the data to -1 and +1 and in the 
second step we look for all Plackett-Burman plans in the coded data. All 
these plans are analysed by stepwise regression with forward selection, by 
unpruned classification trees, by trees consisting only of the tree stump and 
by stepwise linear discriminant analysis (cp. Rover (2003)). All this is based 
on an a priori classification of the response in the phases ‘up’, and ‘down’ 
in the years under investigation, based on Heilemann and Munch (1999). 
Finally, the variables which have turned out to be important are compared 
with the results of existing studies. 

2 Data 

The predictor data set consists of 13 variables which have been measured 
quarterly (157 quarters) in the years 1955/4 to 1994/4 (price index base 
is 1991) (cf. Heilemann and Munch (1999)). The variables (and their ab- 
breviations) are real-gross-national product-gr (BSP91JW), real-private-con- 
sumption-gr (CP91JW), government-deficit-rate (DEFRATE), wage-and 
salary-earners-gr (EWAJW), net-export-rate (EXIMRATE), money-supply- 
Ml-gr (GM1JW), real-investment-in-equipment-gr (IAU91JW), real- 
investment-in-construction-gr (IB91JW), unit-labour-cost-gr (LSTKJW), 
GNP-price-deflator-gr (PBSPJW), consumer-price-index-gr (PCPJW), 
nominal-short-term-interest-rate (ZINSK) , real-long-term-interest-rate 
(ZINSLR). The letters ‘gr’ are an abbreviation of ‘growth rates relative to 
last years corresponding quarter’. 

3 Plackett-Burman designs 

Heilemann and Munch (1999) distinguish 4 phases of the business cycle: 
‘upswing’, ‘upper turning point’, ‘downswing’ and ‘lower turning point’. Each 
quarter has been assigned one of these phases which we assume to be the 
correct one. Here only the phases ‘up-’, and ‘downswing’ are considered. 
Therefore, the phases ‘upper turning point’ and ‘lower turning point’ are 
split in the middle, i.e. if, e.g., the ‘upper turning point’ phase lasts for k 
quarters, k £ N, [k/2] quarters will be added to the ‘upswing’ phase and 
k — [k/2] quarters will be added to the succeeding ‘downswing’ phase, where 
[x] denotes the so called GauB brackets, i.e. the largest integer less or equal 
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to x, x £ N. An analogous convention holds for the ‘lower turning point’ 
phase. These two phases ‘upswing’ and ‘downswing’ are coded by 0 and 1, 
respectively. Note that two phase consideration is standard in business cycle 
analysis. Thus, it is the natural starting point for our studies. Extensions to 
4 classes are planned. 

Plackett-Burman plans only exist if the number of experiments n is a 
multiple of four and the number of variables is n-1 (cf. Plackett and Burman 
(1946), Weihs and Jessenberger (1999)). The Plackett-Burman plan for n = 8 
is shown in Table 1. 



Table 1 . Plackett-Burman plan with 8 experiments. 
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The second row is called generating row, as it generates the rows 3-8 of 
the matrix by being shifted one position to the right at each step. Plackett- 
Burman plans are orthogonal arrays in the sense of (Hedayat et al. (1999)), 
they are of the form OA(4A,4A — 1,2,2), A £ N, (A = 2 in Table 1), i.e. 
each factor has only two levels -1 and +1, the sum of each column is 0 and 
columns are pairwise orthogonal. If an 8th column consisting only of +l’s 
is added to the matrix, one gets a unique Hadamard matrix of order 8 (cp. 
Hedayat et al. (1999)). Therefore it is necessary to code the existing data in 
+1 and -1, in order to look for Plackett-Burman designs. For each variable, 
all values less than its median are taken as -1 and all values greater than or 
equal to its median are taken as +1. As there are 13 variables, one looks for 
Plackett-Burman plans with n = 8 or n = 12 in the coded data. 113 different 
plans were found for n = 8 and none for n = 12. 

The algorithm for finding these plans is first to look for all rows which 
contain at least seven times the number -1. The corresponding columns are 
then searched for the generating row. After this has been found, the search 
continues for the generating row shifted one position to the right, etc. This 
process has to be carried out for all possible permutations of the original seven 
columns. A much faster algorithm has been suggested by S.Haustein (private 

communication), where one looks for the base row uO = ( ) 

and then searches for a row v in the corresponding columns with Hamming 
distance 4 to uO. After this has been found, one looks for a row vl with 
Hamming distance 4 to uO and v. This process is continued until eight rows 
have been found which are equidistant with Hamming distance 4. These eight 
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rows form a Plackett-Burman plan for n = 8, because the Plackett-Burman 
plan for n = 8 is an orthogonal array of the form 0A(8,8 — 1,2,2) and 
this class has only one isomorphism class. Here two arrays are said to be 
isomorphic (cf. Hedayat et al. (1999)), if one can be obtained from the other 
by permutations of rows, columns or factor levels. 

In the following investigations, a linear screening model is used, y = Xj3-\- 
e, where X = (1,A) is an (n x n) matrix with 1 = (1, 1, 1, ...)* and A the 
Plackett-Burman matrix. /3 is the vector of unknown coefficients, y the result 
vector with the coded business cycle phases and e the error vector. 

4 Results 

4.1 Stepwise regression by forward selection 

113 different Plackett-Burman plans were found by the method described in 3. 
When evaluating these plans by stepwise regression with forward selection 
with respect to y (cp. Weihs and Jessenberger (1999)), we used the F-test 
at level 0.2. Figures 1, and 2 show the absolute and the relative frequency 
of the selected variables (dark bars). The light bars show how often each 
variable appears in all 113 Plackett-Burman plans. Figure 1 thus shows that 
each variable is at least once in a plan (light bars) . The variables which turn 
out to be most important by this method are ‘DEFRATE’, ‘EXIMRATE’, 
‘LSTKJW’, TAU91JW’ and ‘ZINSK’(cp. Figure 2). If one uses the F-test 
with level 0.05 one gets the same variables except ‘EXIMRATE’. It is also 
interesting that in almost half of all cases none of the variables turns out 
to be important. Furthermore it strikes that for all variables the dark bars 
are rather small, compared to the light ones. That means that although a 
variable appears often in the plans it is chosen only a few times as important 
concerning the up- and down of the economy. 



4.2 Classification methods 

In the 113 plans, variables are selected also by different classification meth- 
ods, i.e. unpruned classification trees (TreeAllNodes), classification trees with 
only the tree stump (TreeStump) and stepwise linear discriminant analysis 
(cp. Rover (2003)). Figures 3, and 4 again show the absolute and the rel- 
ative frequency of selected variables by the different methods. The number 
in brackets following the variable name indicates how often the variable ap- 
pears in a Plackett-Burman plan. Classification by unpruned trees yields as 
important variables ‘BSP91JW’, ‘CP91JW’, ‘DEFRATE’ and ‘EXIMRATE’. 
Using only the tree stump yields the same variables without ‘CP91JW’ as 
important. This is the same result one gets by stepwise linear discriminant 
analysis. On the whole, these three classification methods yield similar results 
but on different levels. 
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Fig. 1. Absolute frequency of variable selected by stepwise regression with forward 
selection. 
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Fig. 2. Relative frequency of variable selected by stepwise regression with forward 
selection. 

For all used classification methods as well as for stepwise regression with 
forward selection it is important to know how the rows which build the 
Plackett-Burman plans are distributed. This is illustrated in Figure 5 which 
shows how often each row is contained in a plan. Note that the outstanding 
row number 72 refers to the 4th quarter of 1972 and row number 145 to the 
1st quarter of 1991. These years are special years from an economic point of 
view, as in 1972 the German economy suffered from the oil price shock. The 
German unification influences the post 1990 data, an effect shown in the first 
quarter of 1991. 

4.3 Variable assessment 

If one wants to decide which of the above variables plays a dominant role 
with respect to the business cycle, it is important to assess their correlation 
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Fig. 3. Absolute frequency of variables selected in Plackett-Burman design. 
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Fig. 4. Relative frequency of variables selected in Plackett-Burman design. 



in all those Plackett-Burman plans where the corresponding variable was 
included. It turns out (see Table 2) that unit labour costs(‘LSTKJW’) is 
clearly positively correlated to y (84% of all cases) and the government deficit 
(‘DEFRATE’) can still be considered as positive correlated (78% of all cases), 
taking into account a possible error margin. No variable is clearly negatively 
correlated to y. Hence, one may finally consider those variables as important 
which on the one hand are chosen most often, both by regression and by 
classification, and which on the other hand possess a distinct positive or 
negative correlation to y. Using this decision criterion, one gets ‘unit labour 
costs’ (‘LSTKJW’) and ‘government deficit’ (‘DEFRATE’) as variables which 
clearly determine the West German business cycles. 
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Fig. 5. Absolute frequency of rows in all Plackett-Burman plans. 



In previous studies of this topic (cp. e.g. Weihs and Garczarek (2002), 
Weihs et al. (1999)) the variables most influential for the West German busi- 
ness cycle in the 4 phase case were ‘wage and salary earners’ (‘EWAJW’) 
and ‘unit labour costs’ (‘LSTKJW’). Moreover, if one compares the above 
method to stepwise regression by forward selection on the whole data set, 



Table 2. Correlation with respect to y. 
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again taking level 0.2 in the F-test, the model ‘LSTKJW’ + ‘IAU91JW’ + 
‘DEFRATE’ + ‘ZINSK’ + ‘CP91JW’ + ‘BSP91JW’ is chosen. This strongly 
indicates the importance of ‘LSTKJW’ and ‘DEFRATE’. Also stepwise linear 
discriminant analysis, classification by unpruned trees and classification trees 
using only the tree stump were applied on the whole data set. The application 
of unpruned classification trees shows ‘IAU91JW’ to be the most important 
variable, as does classification trees using only the tree stump. Stepwise linear 
discriminant analysis shows that besides ‘IAU91JW’, also two other variables 
are important, ‘LSTKJW’ and ‘PCPJW’. 

5 Conclusion 

‘Unit labour costs’ (‘LSTKJW’) has been detected as an important variable 
by this method as well as by previous methods (cp. 4.3). This strongly indi- 
cates that this variable has a great influence on the West German business 
cycle. The question why the ‘government deficit’ (‘DEFRATE’) turns out to 
be important here, but does not so in previous studies, requires a thorough 
analysis of the influence of the methods applied here on the results. The ad- 
vantage of using Plackett-Burman plans lies in the clean and easy selection of 
variables in determining the important variables. This is only a first step in 
this direction. Right now, we are investigating only the correlations of those 
variables with the business cycle, which have turned out to be important in 
the above described investigations. A next step could be to investigate a sim- 
ilar procedure with full factorial designs or fractional factorial designs. These 
plans also respect orthogonality, but in addition permit interactions between 
the factors. 
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Abstract. In this work we introduce a method for classification and visualization. 
In contrast to simultaneous methods like e.g. Kohonen SOM this new approach, 
called KMC/EDAM, runs through two stages. In the first stage the data is clustered 
by classical methods like K-means clustering. In the second stage the centroids of the 
obtained clusters are visualized in a fixed target space which is directly comparable 
to that of SOM. 



1 Introduction 

In many applications a classification of the examined objects in both in- 
ter-heterogeneous and intra-homogeneous groups (clusters) is desired. Many 
methods have been developed to solve this problem and are subsumed under 
the term classification-methods as well as clustering-methods. 

In the context of clustered objects another problem often occurs. This 
problem consists of the graphical representation - called visualization - of the 
objects resp. classes which are often represented by high-dimensional data 
vectors in a space of lower dimension. The requirement for such representa- 
tions is topology preservation, i.e. objects which are comparatively close in 
the original space should also be close together in the representation space 
and, corresponding by, pairs of distant objects should have high distances in 
the visualization. 

One method, which can be interpreted both as a visualization and a clas- 
sification method, is the so called Kohonen Self-Organizing-Map (SOM) (Ko- 
honen (1990)). SOM performs classification and visualization simultaneously. 
Many alternatives to SOM have been proposed in the past. One example is 
another simultaneous method suggested by Bock (1997). Bezdek and Pal 
(1995) compare the methods principal component analysis (PCA) and the 
Sammon algorithm to SOM concerning topology preservation. They try to 
avoid the problem of different solution spaces - with SOM in contrast to the 
latter methods only a subset of the objects is visualized - by assigning to 
each object an image in the neighborhood of the nearest visualized object. 

* This work has been supported by the Deutsche Forschungsgemeinschaft, Sonder- 
forschungsbereich 475. 
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Since this is done by randomly jittering it is questionable if the corresponding 
results can still be seen as the results generated by SOM. Hence the results 
of Bezdek and Pal - PCA and Sammon are superior to SOM - are to be 
interpreted cautiously. 

Being aware of the aforementioned comparability-problems we introduce 
a new approach of carrying out classification and visualization one after 
the other. This approach consists of a combination of classical classification 
methods (mainly K-Means-Clustering, KMC) and a new approach for the 
visualization of the corresponding centroids. This approach is called Eight- 
Directions-Arranged-Map (EDAM) and has a fixed representation space. This 
solution space can be chosen in SOM as well. Under these conditions crite- 
ria for classification and topology preservation can be defined and compared 
between the two methods. 

This paper starts with a description of the methods in section 2. Then 
section 3 gives a view on a few examples. The paper concludes in a summary 
given in section 4. 



2 Methods 

2.1 Preliminaries 

All following methods refer to a data matrix X £ M nxk . Its rows aq., ..., x n . £ 
]R k represent the data vectors of n corresponding objects and its columns 
x.i,...,x.k £ JR n represent the measurement vectors of k corresponding 

variables. Distances between two data vectors Xi. and Xj. are denoted by 
d(xi.,Xj.). We use the ordinary euclidean distance in this paper. 

A classification of X is a set of c clusters, where each object belongs to 
exactly one cluster. A classification is denoted by a vector k £ {l,...,c} n , 
where the *th element Ki of k gives the cluster- number of the ith object. A 
common representative of cluster i is the so called centroid /q £ M k , which 
is defined as: 

Mi = (m*i, ■••iMifc)' with fiih = ~ Xj h , h=l,...,k, 

’ j:Kj=i ( 1 ) 

rk = #{j : Kj = *}> * = l,-,c. 

All centroids are compiled in the centroid matrix M = (/rq) i<*<c • 

l<j<fc 

A visualization of X is a function / : {xi., ...x„. } — > Z C ]R nxrn ,m < k, 
which assigns an image z 1 — (zj, .., z^)' = /(x,.) to each row of X. Z is 
called the image-space. 

With Z = (z)) i<i<n the visualization / may be written as f(X ) = Z. In 

•' 1 <3<m 

the following we only consider the case of m = 2. 
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2.2 Basic idea 

Our approach to visualize high-dimensional data in a plane is based on the 
idea of considering the plane as a topographical map. When the images are 
visualized as the vertices of a rectangular grid, each object has eight direct 
neighbors, one in each direction of the compass (by taking NE, SE, SW and 
NW into account, compare figure 1 ). We try to obtain topology preservation 
by re-ordering the objects on each of these eight directions corresponding to 
the distances of their data vectors in the original space M k . Considering the 
example of the vector pointing from z 20 to west in figure 1 this means, that 
with Xi. = after re-ordering, i.e. interchanging the values of X21 . to 

*24-, the relation d(x 20 ., £21) < d(x 20 .,x 22 .) < d(x 2 Q .,x 23 .) < d(x 20 . , x 2 4.) 
holds. 

The method EDAM visualizes by repeating this “star-shaped” re-ordering 
step successively for all objects up to either convergence or to another stop- 
ping criterion. The following subsection gives a formal definition of the meth- 
od. 



2.3 KMC/EDAM 

The classification of X into a set of c < n clusters, c given, by the method 
KMC/EDAM is performed by a combination of a K-Means-algorithm and 
a hierarchical method. First g > c clusters are constructed by applying the 
K-Means-algorithm suggested by Forgy (see Anderberg ( 1973 )). Then the ag- 
glomerative hierarchical Centroid-method (see Kaufmann and Pape ( 1996 )) 
is applied to these clusters. After (g — c) steps of this method the final clas- 
sification k of the n objects into c clusters is obtained. 

In the next stage of KMC/EDAM the centroids {/ii, ..., g c } of k are visu- 
alized. Therefore first the image space is fixed to the points of intersections of 
61 vertical and b 2 horizontal lines of a two-dimensional, equally spaced grid, 
with c = b\-b 2 . By labelling the images by their integer Euclidean coordinates 
and enumerating them from the lower left corner by rows the image-space 
can be written as: 



2 = {z\ 



z c } with z l 





(2) 



The problem of visualizing the centroids in Z by a visualization / is to find 
a permutation 7 r of { 1 , ..., c}, such that = z l ,i = l,...,c, preserves 

topology as well as possible (concerning to a predefined criterion). 

The main idea of our method is to consider 1 each centroid /x Tt _ 1 (j) as 
a “reference point” for the centroids whose images are lying on the vectors 

1 The consideration of one centroid defines one step denoted by index f; the index i 
defining the actual centroid computes to i = t — ■ c, i.e. each time t exceeds 

a multiple of c, i is switched back to 1. 
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pointing from z l to each direction D £ {N, NE, NW}, where ttq is a 
randomly chosen initial permutation. First, for each direction D , the indices 
jq ,q = 1 ,...,n,D of these images are determined. Table 1 gives an overview 
of how these indices are calculated for all directions. 



Table 1 . Calculation of indices 
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3 f 


tid 
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i + qbi+q 


min(riN, ue) 
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i — qbi + q 


min(ns, he) 


S 


i - qb\ 
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min(ns , nw) 
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i-q 


zl- 1 


NW 


i + qbi — q 


min(riN ,nw) 



Let now ifD be the permutation of so that 

/Vo^rt-iOa 3 )]) 

for each direction D. Now, set 7 r t := 7r t _i. Next, the following substeps are 
repeated for all directions D: 

1. 7rf := 7 r t 
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7T? , if S(n?) < S{TT t ) 
7 r t , else 



The function S' is a predefined criterion for visualizations with lower values 
indicating better visualizations. Repeating the described procedure for all 
centroids - i.e. a set of c steps builds one iteration. In our investigations we 
choose S as the STRESS known from MDS (see Hamerle and Pape (1996, p. 



Each time, when no more improvement can be obtained after a com- 
plete iteration (or alternatively if a given maximum number of iterations is 
reached), the area, in which re-ordering is possible is decreased by changing 
the values of tid in table 1 to min(riD , max[bi, 62 ] — r), where r runs succes- 
sively from 1 to max[bi, 62 ] — 2. A set of iterations with the same value of r 
is called iteration cycle. 

The final visualization result /(/^(j)) = z l ,i = 1, ..., c, of KMC/EDAM is 
obtained by setting tt := where t is the number of the last step. 

3 Examples 

First the introduced method is applied to the synthetic Chainlink data, which 
consist of two three-dimensional interlocking ring-shaped classes as seen in 
figure 2. In our example each class contains 1000 data points. 



On the right side of figure 2 the two-dimensional result of the method 
MDS for this example is depicted. The STRESS of this result is 0.246. Note 
that in the original space the two classes have exactly the same relation to 
each other, i.e. they have the same shape, the rings have the same radius 



769)). 




Fig. 2. The Chainlink data and their MDS visualization 
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and the center of each ring lies on the other one. But looking at the MDS 
visualization one gets the impression that there are differences between the 
shapes of the classes. 

For the computation of KMC/EDAM for the Chainlink data the following 
settings were used: <7=750, c=500,5i = 20, &2 = 25, maximum number of 
iterations per cycle: 10. The result is shown in U-Matrix-representation on 
the left side of figure 3. The U-matrix is a well-known tool developed for the 
representation of Self-Organizing Maps (compare Ultsch (2003)). Since the 
image space of KMC/EDAM is restricted to a rectangular grid the U-Matrix 
can easily be applied to the results of this method as well. For comparison 
purposes the right side of figure 3 shows the U-matrix of a SOM of the same 
size applied to the same data. For the computation of the SOM the package 
som available for the statistical software R (R Development Core Team (2004)) 
with its default settings was used. The size of the symbols in both pictures 
corresponds to the number of objects assigned to each cluster. 
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Fig. 3. KMC/EDAM and SOM visualization of the Chainlink data 



The STRESS of the KMC/EDAM solution is 0.209, that of SOM is 0.252, 
so KMC/EDAM seems to be better. Beyond this superiority of KMC/EDAM 
to the MDS and SOM the result of KMC/EDAM gives an evidently better 
mirror of the fact, that the Chainlink classes are equally placed relatively to 
each other. This is not the case for SOM, since the class depicted by circles 
seems to surround parts of the classes depicted by triangles. At first glance 
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the separation of the classes seems better for MDS resp. SOM, since the lat- 
ter leave gaps between the classes. But in the U-Matrix of the KMC/EDAM 
result a dark line is visible which corresponds to relatively high distances 
between objects along the line. This line runs between the two classes like 
a boundary. The brightness of the rest of the map is well-adjusted, which 
suggests that the topology within the classes is well preserved. Another ad- 
vantage of the KMC/EDAM result compared to that of SOM is, that it main- 
tains the connection of the classes, i.e. there are now exclaves. In the SOM 
result there are apparently a few objects of the “triangle class” separated 
from the rest by the “circle class” . 

The next example we consider is the well-known iris data set introduced 
by Fisher (1936), which contain setal and petal lengths and widths of three 
species of iris for 150 flowers. Figure 4 shows a plot of the MDS result and the 
U-matrices of KMC/EDAM and SOM results for this example. The settings 
of KMC/EDAM were: <7=50, c=35,6i = 5, 62 = 7, maximum number of 
iterations per cycle: 10. 





Fig. 4. MDS, KMC/EDAM and SOM visualizations of the iris data 



The STRESS of KMC/EDAM is 0.351 in this case, while that of SOM 
is 0.252. MDS performs even better with a STRESS of 0.04. Similarly to 
the previous example SOM leaves a gap between the well-separated classes 
versicolor and setosa. Again this separation is visible as a dark line in the 
U-matrix of the KMC/EDAM result. The separation of the classes virginica 
and versicolor seems slightly better in the SOM result, since one can notice 
the darkest squares between these classes than at other regions of the map. 
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4 Conclusion 

With KMC/EDAM a method is introduced which allows to visualize the 
results of classical clustering methods. Because of its specially chosen target 
space the results of this method are directly comparable to those of SOM. 
The method is applied to two popular examples, the artificial Chainlink data 
and Fisher’s iris data. In the critical Chainlink example KMC/EDAM leads 
to better results than MDS and SOM. 

In the iris example KMC/EDAM has the highest STRESS. But the rel- 
ative positions of classes are the same with KMC/EDAM. Furthermore the 
lacking separation - which is probably the reason for the higher STRESS - 
becomes visible as well by representing the result in an U-matrix. 

Modifications for the improvement of EDAM are conceivable. Such modi- 
fications may concern the optimization of the initial ordering of the centroids. 
On the other hand a method like Simulated Annealing could be integrated 
into the algorithm to avoid local optima. First attempts in this direction led 
to promising results. 
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Abstract. Clustering of variables around latent components is a means of orga- 
nizing multivariate data into meaningful subgroups. We extend the approach to 
situations with missing data. A straightforward method is to replace the missing 
values by some estimates and cluster the completed data set. This basic imputation 
method is improved by more sophisticated procedures which update the imputa- 
tions within each group after an initial clustering of the variables. We compare 
the performance of the different imputation methods with the help of a simulation 
study. 



1 Introduction 

The problem of missing values occurs very often in practice. In this paper 
we propose methods to deal with the problem of missing values when we 
want to cluster variables. A method of clustering variables around latent 
components (CAVALC) was proposed by Vigneau and Qannari (2003). This 
procedure bears some similarity to VARCLUS which is implemented in SAS 
(SAS/STAT (1990)). However, it is based on a simple algorithm and may be 
extended in various ways as discussed by Vigneau and Qannari (2002). This 
method is briefly described in section 2. 

An important application of this technique is given by the clustering of 
consumers who give their scores of preference to different products. Usually 
these preference scores are analysed by means of a preference mapping tech- 
nique which mainly consists in performing a principal components analysis 
on the data set whose rows are the products and columns are the scores given 
by the various consumers (Greenhoff and MacFie (1994)). 

However, it is not always possible to present all the products to each con- 
sumer, especially when we have saturating products such as beers. Therefore 
each consumer evaluates a subset of the products. The resulting data set is 
incomplete. 

In section 3 we propose some imputation methods for the clustering in 
this situation. The real data set under study is briefly described in section 
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4. We compare these imputation methods on the basis of a simulation study 
(section 5). 

2 Clustering of variables around latent components 

We consider p variables xi,X 2 , ■ ■ ■ ,x p measured on a sample of n observa- 
tions. The procedure of clustering discussed herein consists in representing 
each cluster of variables by a latent component. 

More precisely, the strategy consists in simultaneously determining K (sup- 
posed to be fixed) clusters of variables and K latent components ci, C 2 , . . . , Ck 
such that 



is maximized under the constraint c' k Ck = 1. In this criterion S, the parameter 
Sjk is equal to 1 if variable Xj belongs to cluster k and 0, otherwise, and cov 
stands for the covariance. 

S is maximized by a partitioning algorithm in the course of which latent 
components and group memberships are iteratively updated. The initializa- 
tion of the partitioning algorithm is based on an agglomerative hierarchical 
procedure (Vigneau and Qannari (2003)). 

For a given cluster Gk, the latent variable Ck is collinear with the mean 
Xk of the variables belonging to cluster Gk- 

3 Imputation methods 

3.1 Direct imputation methods 

An intuitive method for dealing with missing values consists in replacing 
each missing value by the mean of the observed values for the variable under 
consideration. We will refer to this method as the vertical imputation. 

In the special case of preference data it is also possible to replace each 
missing value by the mean of the observed values for the observation (prod- 
uct) under consideration. In that way, the mean score observed on the whole 
panel is attributed to all missing data for a product. We will refer to this 
imputation method as the horizontal imputation. 

After replacing the missing values by these estimates, we cluster the com- 
pleted data set. 

3.2 Imputation within each cluster 

We can improve the results by updating the imputations after clustering. 
Each missing value is replaced by an estimate that is based on the values 
observed on the variables of the same group. 
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In preference studies the consumers use the given scale in different man- 
ners. More precisely, consumers may score at different levels of the scale, as 
it will be illustrated in section 4, or may differ in the range of scoring. To 
make the variables comparable, it is necessary to firstly standardize them as 
follows: 

- _ x ij ~ Ai 

•"ij — 

o-i 

The needed estimations jlj and of the mean and standard-deviation are 
calculated using the observed values of the variable. In each group Gfc, the 
mean m.ik of each observation i (i = 1 ,n) is calculated using the stan- 
dardized observed values. 

If the value of observation i for variable j in group Gk is missing, z-ij = 
rriik is used as an estimate. Finally, the imputed data are ’destandardized’ 
as follows: Xjj = Zij&j + jlj. The calculation of the latent components is 
finally updated using the observed values and the new imputations Xjj of the 
unobserved values. 



3.3 Method based on a cross-partition 

This method is based on the two different initial imputations outlied in section 
3.1 (horizontal and vertical imputations). Generally, the clustering of the two 
completed data sets provides two different partitions into K groups. The 
analysis of their cross-partition may improve the grouping. We proceed as 
follows (Sahmer (2003)): 

We calculate the cross-partition which consists of I\ 2 groups, called ’sta- 
ble groups’ (’groupements stables’, Lebart et al. (2000)). As an illustration, 
consider the clustering of six variables which we denote 1, 2, 3, 4, 5, 6. Sup- 
pose that the first clustering leads to the partition (1, 2, 3) and (4, 5, 6) and 
the second clustering leads to the partition (1, 2, 6) and (3, 4, 5). The four 
stable groups are obtained by considering the intersections of each group from 
the first partition with each group of the second partition. This leads to the 
partition (1, 2), (3), (6) and (4, 5). The imputations are then updated in the 
stable groups according to the procedure described in section 3.2. However, it 
should be noted that the stable groups may contain only very few variables. 
So it is possible that in group Gk there is no data for an observation i. In this 
case, the mean of the observed values of the variable (vertical imputation) is 
used. 

The K stable groups with the largest numbers of variables are determined. 
In the case of ties, we may randomly select which of the tied groups to retain. 

The latent components Cfc = Xk/\/x' k Xk of these K largest stable groups 
are calculated. The covariances of all the variables not belonging to these K 
groups with each of the K latent components are determined. Each variable 
is assigned to a group, considering its largest covariance with the group latent 
components. 
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Finally, in each of the K groups, the imputations and the latent compo- 
nent are again updated as described in section 3.2. 



4 Illustration: data set ’jam’ 



The methods are compared on the basis of a real data set from a preference 
study. These data were collected by students at ENITIAA (Nantes, France) 
during a training period. These students manufactured seven varieties of jam 
with different percentages of apple and pear and added vanilla or cinnamon 
flavour. Hedonic ratings were given by a panel of consumers. In addition, the 
sensory properties of the jams were evaluated. Therefore, we have three kinds 
of variables concerning the jams: the compositional data, sensory variables 
and the hedonic scores. 

In this paper we focus on the analysis of the hedonic data. 56 consumers 
gave their scores of preference. They scored each product on a non structured 
10 cm scale according to their liking. 



Table 1 . Preference data of two consumers 





Observed data 


Standardized data 




Consumer 41 


Consumer 53 


Consumer 41 


Consumer 53 


Jam 1 


5.0 


1.2 


-0.9 


-0.2 


Jam 2 


8.3 


3.6 


1.6 


1.9 


Jam 3 


6.0 


0.4 


-0.1 


-0.9 


Jam 4 


4.0 


0 


-1.7 


-1.2 


Jam 5 


7.0 


2.4 


0.6 


0.9 


Jam 6 


7.0 


1.1 


0.6 


-0.3 


Jam 7 


6.0 


1.2 


-0.1 


-0.2 


Average 


6.2 


1.4 


0 


0 


Standard deviation 


1.3 


1.1 


1 


1 



As an illustration, the left side of Table 1 shows the scores given by con- 
sumers 41 and 53 to the seven varieties of jam. This example makes it possible 
to outline the different use of the scale by the consumers. Obviously, consumer 
41 gives higher scores than consumer 53. Nevertheless, their patterns of liking 
are very similar as reflected by the standardized data. 

We used this data set together with simulated data in order to compare 
the different imputation methods. 
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5 Simulation study 

5.1 Jam data set 

We clustered the complete data set and determined two groups of consumers. 
In a subsequent stage, we simulated one to four missing values per consumer. 
We clustered the incomplete data sets with each of the imputation methods 
and compared the results to the clusters obtained from the complete data set. 
For each number of missing values, we repeated this procedure 100 times. 

5.2 Simulated data 

We simulated data that were designed to reflect the different ways consumers 
use the given scale. For each group k , we simulated an average score m,fc for 
each product i. The score of each consumer j is based on this average. More 
precisely, the score of a consumer j in cluster k is simulated by multiplying 
rriik by a scaling factor djk which reflects the different ranges of scale used by 
the consumers. Thereafter, a translation tjk and a normally distributed noise 
€ijk are added. The score of consumer j in group k for product i is given by 
(Callier (1996)): 

%ijk — ^jk T U7 'k T djk^TYlik rUfc) T djk^-ijk^ ^ijk ^ N(0, (J ). 

We simulated data sets with two and three groups. The standard deviation 
of the noise was set to a = 0.5 and to a = 2. For each combination of 
these two parameters we simulated 100 data sets with eight products and 
56 consumers each. Then, in each data set we simulated one to five missing 
values per consumer. We clustered the incomplete data sets with the different 
imputation methods. We compared the resulting groups to the simulated 
groups. 



5.3 Criterion for comparison 

The comparison of the methods was based on different criteria. We show 
herein the results regarding the criterion ACALC (Average of the Corre- 
lations of the Associated Latent Components). It was proposed by Cal- 
lier (1996) under the acronym of MCGA (Moyenne des Correlations entre 
Groupes Associes). This criterion indicates the extent to which the latent 
components of the incomplete data are related to those of the complete data. 
However, the labelling of the groups is arbitrary. Group 1 of the complete 
data set does not necessarily correspond to group 1 of the incomplete data 
set, etc. Consequently, we first have to determine the association between the 
groups and the latent components in the two partitions (from the incomplete 
and complete data sets). In practice, we consider each possible combination 
and calculate the average correlation between associated latent components 
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for each of them. The criterion ACALC corresponds to the maximum average 
correlation. 

The calculation is illustrated for the case of K = 2 as follows. The average 
correlations of the possible combinations are given by 

mcori = -(cor(ci,Ci) +cor(c 2 ,c 2 )) 

and 

mcor 2 = — (cor(ci, c 2 ) + cor(c 2 , 6i)), 

where c*, are the latent components of the complete data and the latent 
components of the incomplete data. 

We define 

ACALC = max{mcor\,mcor 2 ). 



5.4 Results 

It was observed that the two direct-imputation methods were almost always 
improved by an update of the imputations after clustering. 

In the following, we consider the vertical imputation with an update of the 
imputations, the horizontal imputation with an update of the imputations 
and the method based on a cross-partition. Their results are shown in Figure 
1. For the simulated data sets we only give the results for two groups and a = 
0.5 and for three groups and er = 2. The results for the other combinations 
of parameters lay between these two extremes. 

The figure shows the average of the criterion ACALC over the simulations 
as a function of the number of missing values per consumer. As it can be 
expected, the quality of the clustering decreases when the number of missing 
values increases. 

For the simulated data with two groups and a = 0.5 the three methods 
are almost equivalent. In this situation (small noise) we obtain good results 
when up to five out of eight values (60 %) are missing. 

For the case of three groups and a — 2 and also for the data ’jam’, the 
method which is based on a cross-partition and the horizontal imputation 
with an update of the imputations perform best. We observe a sharp decrease 
in criterion ACALC when the number of missing values increases. If we set 
up a limit at 0.9 regarding the average value of criterion ACALC, it turns out 
that the two best methods give good results when up to three out of eight 
values (40 %) are missing (simulated data with three groups and a = 2) or 
when up to two out of seven values (30 %) are missing (data ’jam’). 

However, the results of the different simulations showed great variations. 
Therefore, it is interesting to consider not only the average value of ACALC 
but the complete distribution. Boxplots of the values of criterion ACALC for 
the data with three groups and a = 2 are given in Figure 2, for three missing 
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Fig. 1. Mean of criterion ACALC for the vertical imputation with an update (’vert 
update’), horizontal imputation with an update (’hor update’) and the method 
based on a cross-partition (’cross’) 
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Fig. 2. Boxplots for three groups, cr = 2 and three missing values 
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values per consumer. It appears that the method based on a cross-partition 
safeguards against very poor results. 

6 Conclusion 

We have compared several imputation methods with regard to their perfor- 
mance within a clustering of variables framework. Two methods performed 
very well when applied in preference studies context. The first one replaces 
each missing value by the mean of the observed values for the observation un- 
der consideration and then clusters the completed data set. After clustering, 
the imputations are updated within each cluster. The second one is based on 
the clustering of two completed data sets and the cross-partition of the two 
partitions thus obtained. 

In an ideal situation with small noise these methods perform very well 
even when more than half of the values are missing. In more realistic situa- 
tions, they can perform well if less than one third of the values are missing. 
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Abstract. We present a method for on-line classification of triggered but tempo- 
rally blurred events that are embedded in noisy time series. This means that the 
time point at which an event is initiated or a dynamical system is perturbed is 
known, e.g., the moment an injection of a therapeutic agent is given to a patient. 
From the ongoing monitoring of the system one has to derive a classification of the 
event or the induced change of the state of the system, e.g., whether the state of 
health improves or degrades. For simplification we assume that the reactions form 
two classes of interest. In particular the goal of the binary classification problem is 
to obtain the decision on-line, as fast and as reliable as possible. 

To provide a probabilistic decision at every time-point t the presented method 
gathers information across time by incorporating decisions from prior time-points 
using an appropriate weighting scheme. For this specific weighting we utilize the 
Bayes error to gain insight into the discriminative power between the instantaneous 
class distributions. 

The effectiveness of this procedure is verified by its successful application in the 
context of a Brain Computer Interface, especially to the binary discrimination task 
of left against right imaginary hand-movements from ongoing raw EEG data. 



1 General framework 

In this paper we present an approach how to improve on-line classification 
of sequential data by combining information across time. Accordingly the 
proposed method has the following underlying assumptions: First we consider 
only binary classification problems, which can directly be interpreted as the 
detection of two distinguishable states of a dynamical system, embedded in a 
high-dimensional noisy environment. Second we assume that the event onset 
or the time the system is perturbed is known. However the development of 
the event or the change of the systems state might be fuzzy, in the sense that 
relevant informations are blurred or spread over time. Examples of such kind 
of problem often occur in biomedical investigations, e.g., monitoring the vital 
state of health of a patient after an injection (Morik et al. (2000)). Also in 
the more general context of control and feedback control systems, this is a 
commonly used framework. 
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time 

Fig. 1 . Single epoch: containing a pre-event time interval and the event embedded 
in a high-dimensional time series. The task is to classify on-line the change of the 
system during the time window of interest after the event onset (gray area). 



The task is to provide an on-line classification of the systems state at every 
time point t in a time window of interest after the event onset. The decision 
must be derived from the ongoing observations of the high-dimensional noisy 
process. 



1.1 Data format 

Relative to the given event onsets we cut the time-series into epochs where 
each epoch contains a single event in the corresponding time window of in- 
terest. Additionally, the epochs can also include data from pre-event time 
intervals, that can be used for calibration or baseline correction (see Fig. 1). 

Let Xi, . . . , xjy £ X denote the training sample, where xjt(i) is the high- 
dimensional observation of the k-th epoch, k = 1, . . . , N, and t = T a , . . . , T e 
the time index. The event onset takes place at to, T a < to < T e . The cor- 
responding class labels yi, . . . , yjv £ y = { — 1, 1} of the training sample are 
given. The on-line classification can now be formulated as a collection of 
mappings ft : X — > y. These functions ft can be estimated on the basis 
of Z t = {(xfc(fj),yfc), k = 1, . . . , N, to < ti < t}. Utilizing these estimated 
functions the on-line classification of unlabeled epochs can be derived. 

1.2 On-line classification 

Regardless of the used classification algorithm one can distinguish between 
two opposite on-line classification approaches: ‘instantaneous’ and ‘batch’ 
classification. The instantaneous classification gives a decision d ti for each 
time point ti based only on the observation at this time Xk{ti). On the oppo- 
site batch classification derives the decision D ti based on all previous obser- 
vations {xfc(t), t < ti}. Both approaches have advantages and drawbacks. The 
series of instantaneous classifications dt can be very unsteady, whereas in con- 
trast the series of D t is more stable. However this stability comes at the cost 
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of an increased model complexity, also the amount of used data can become 
intractable due to memory restrictions. The complexity of the instantaneous 
classification problem is lower and keeps constant over the time. 

Similar to batch classification one can also combine all preceding instan- 
taneous decisions d t , t < ti, to a single decision D ti . This combination can 
be done in several ways. One can use: 



Applying anyone of these combination methods one implicit assumes that 
each time point t contains the same ‘amount’ of information. In many prac- 
tical applications this assumption is questionable. Furthermore by using the 
product based combination scheme one assumes independence of the deci- 
sions, that does not hold in general for the class of problems we are address- 
ing. 

Instead of using these combination schemes we suggest to derive a decision 
D t , at each time point ti as a weighted combination of all previous, instan- 
taneous decisions dt,t < ti, where the weights are chosen proportional to 
an estimate of the amount of discriminative information that can be derived 
from this specific time point. 

The paper is organized as follows: The introduction of the method in 
section 2 is followed by an experimental section, in which we present results 
of a case study, illustrating the benefit of the proposed approach. 

2 Integration of information across time 

In order to finally derive the on-line classification at a certain time ti, we 
incorporate knowledge from all preceding time points to < t < ti, leading 
to an evidence accumulation over time about the binary decision process dt- 
The temporal combination is realized by taking the expectation of the class 
probability with respect to the discriminative power of each time instance. 

More formally, a decision at time ti is given through a weighted linear 
combination of the previous decision process d t : 



The weights gt represent the discriminative power. The question arises how 
to measure this discriminative power. In our situation we estimate two class 
distributions P(x(t)\y), y £ {—1,1}, on the training data. Applying Bayes 
decision rule, one decides for y = 1 if P{y = l|x(f)) > P{y = — l|x(f)), 
otherwise for y = —1. The Bayes error of this decision rule is given through: 
P(error|x(f)) = min[P(y = l|x(f)),P(y = — 1 |x(t))] . A small Bayes error 



1) expectation D ti = J2tLt 0 d t, 

2) majority vote D u = argmaXyg-y}^^ = ^)}> 

3) product D ti = {ItLto dt - 



U 




(1) 



t=to 
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indicates a good separability of the two class distributions while on the other 
hand, if the Bayes error is high the data contain less discriminative informa- 
tion. Consequently, we use the Bayes error as a measure of the discriminative 
power at time point t. Thus the weights are defined as 

g t = 0.5 — P(error|x(t)). (2) 

Since the estimation of the Bayes error is usually intractable, we exploit the 
Chernoff bound (Duda et al. (2001)), which upper bounds the Bayes error. 
The Chernoff bound is defined as the minimum over all /3 £ [0, 1] of the right 
hand side of the following inequality: 

P(error) < P(y = l) f3 P(y = —1) 1 ~' 3 J p(x|y = l) /3 p(x|y = — 1 ) 1_/3 dx. (3) 

Notice that, if p(x|y = 1) and p(x\y = —1) are normal, the Chernoff bound 
can be evaluated analytically by finding f3 that minimizes 

J p(x|y = l)fyx| y = -1 ) 1 - /3 dx = e~W\ (4) 

where 

2 m = / 3(l- / 3)( M+ -y_)'[^_ + (l-^)r + ]- 1 (y + - / i^)+ln 

3 Application 

As an application of the proposed method we choose the binary classification 
problem of distinguishing between left and right imagined hand movements 
from recordings of electroencephalogram (EEG). This is a common task in 
the newly emerging field of Brain Computer Interface (BCI)(Krausz et al. 
(2003), Dornhege et al. (2004)). 

The data we use in this study, are taken from the 2003 BCI-competition 
(Blankertz et al. (2003)). In particular we apply our method to data set III 
- “imagined hand movement”, provided by the Dept, of Med. Informatics, 
Inst, for Biomed. Eng. at the Univ. of Techn. Graz. The EEG from three 
channels (C3, Cz, C4) was acquired with band filter settings of 0.5 to 30 Hz 
and sampled at 128 Hz. The data consist of 140 labeled and 140 unlabeled 
trials of imaginary hand movements, with an equal number of left and right 
hand trials. Each trial has a duration of 9 s: after a 3s preparation period a 
visual cue (arrow) is presented pointing either to the left or the right. This 
is followed by another 6 s for performing the imagination task (for further 
details see Blankertz et al. (2003)). The specific competition task is to provide 
an on-line discrimination between left and right movements for each of the 
140 unlabeled single trials (STs). In particular, at every time instance in the 
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interval from 3 to 9 seconds a decision and its confidence must be supplied. 
The objective of the competition was to detect the respective motor intention 
as early and as reliable as possible and therefore perfectly meets the settings 
of the proposed method. 



3.1 Neurophysiology 

The human perirolandic sensorimotor cortices show rhythmic macroscopic 
EEG oscillations (//-rhythm) (Hari and Salmelin (1997)), with spectral peak 
energies around 10 Hz (localized predominantly over the postcentral soma- 
tosensory cortex) and 20 Hz (over the precentral motor cortex) . Modulations 
of the //-rhythm have been reported for different physiological manipula- 
tions, e.g., by motor activity, both actual and imagined (Jasper and Pen- 
field (1949), Pfurtscheller and Arabibar (1979), Schnitzler et al. (1997)). 
Standard trial averages of //-rhythm power show a sequence of attenuation, 
termed event-related desynchronization (ERD) (Pfurtscheller and Arabibar 
(1979)), followed by a rebound (event-related synchronization: ERS) which 
often overshoots the pre-event baseline level. Imaginary movements modulate 
the //-rhythm on the hemisphere contralateral to the respective event more 
than ipsilateral (Pfurtscheller and Arabibar (1979), Schnitzler et al. (1997), 
Nikouline et al. (2000)), e.g. left imaginary movement causes stronger per- 
turbations on the right motor cortex (C4) and vice versa (see Figure 2). 



3.2 Model 

In order to distinguish between STs of left and right hand imaginary move- 
ments, we utilize the accompanying EEG //-rhythm perturbation. Similar ap- 
proaches were pursued in (Pfurtscheller et al. (1997), Neuper et al. (1999)). 
Since we assume that the mid-line channel Cz contains little discriminative 
information, we exclude it and restrict the analysis to C3 and C4. To extract 
the modulations in the two relevant frequency bands, we map the EEG to the 
time-frequency domain by means of Morlet wavelets (Torrence and Compo 
(1998)). Furthermore, we assume the existence of two distinguishable proto- 
typical behaviors of modulation for the absolute amplitude of the //-rhythm 
caused by either imaginary left or, respectively, right hand movements. Based 
on these physiological concepts we estimate two probabilistic models, one for 
each class of imaginary movement. For each class and at any time instance 
t £ [0 — 9] s we assume a 4-dimensional Gaussian distribution of the feature 
vectors a(f) (the amplitudes of the two relevant frequency bands at the two 
electrodes), i.e. , 

p(a(t)\y) = N(n v (t), S v {t )) , (5) 

where /z y (f) and S v (t) are the individual means and the covariance matrices 
of the two classes y £ {L,R} that have been estimated in a robust manner. 
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Fig. 2. The panels show the averaged event-related desynchronization (ERD) of the 
p-rhythm at 10 Hz for the imagination of left (solid line) and right (dashed line) 
hand movement. The vertical line indicates the begin of the imagination period. 
The /r-rhythin amplitude is attenuated in relation to the preceding baseline during 
the motor intention. This attenuation is prominent contralateral to the intended 
movement, i.e. , for right hand movement over the left hemisphere (C3) and over 
the right hemisphere (C4) for the left hand. 



The instantaneous classification at a single time point is given by 



p(y\a(t )) 



p( a (t)\y) 

P ( a (i) |L) + p (a(f) |R) ’ 



(6) 



The temporal combination according to eq.(l) is realized by taking the 
expectation of the class probabilities from eq.(6) with respect to the discrim- 
inative power g t at each time point: 



p{y\a(t 0 ),...,a(U )) 



J2t 0 <t<u 9 tp(y\a{t)) 
St 0 <t<ti 9t 



( 7 ) 



As described above (section 2) we measure the discriminative power 
through the Bayes error, which itself is approximated from above by the 
Chernoff bound eq.(3) and finally define gt using equal class prior probabili- 
ties by 

2g t :=l— min f p(a(t)\L) /3t p(a(t)\R) 1 ^ llt da(t). (8) 

0</3t<l J 

The distributions of the feature vectors a (t) are normal, therefore the 
minimization can be obtained easily. Fig. 3 shows the estimated Chernoff 
bound, given the labeled training data. Note that the most discriminative 
information occurs around 4.5 s, as indicated by the minimum of the error 
bound that corresponds to the maximum weight in the integration process. 

Due to the submission requirements of the competition the final decision 
at this time point is 



d ti = 0.5 — p(L\a to , . . . ,a ti ), 



( 9 ) 



where a positive or negative sign refers to right or left movements, while the 
magnitude indicates the confidence in the decision. 
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Fig. 3. Left panel shows the time course of the classification error (thin solid), 
the Chernoff bound on the Bayes error (dashed) and the mutual information (thick 
solid). Right panel displays time course of the mean and standard deviations of 
the decision according eq. (9) for right (light grey) and left (dark grey) imaginary 
movements on the test data. 



3.3 Results 

After all model parameters have been estimated by means of a leave-one- 
out cross-validation optimization on the labeled training data, we applied 
the estimated model to the feature vectors of the unlabeled STs of the test 
data. The resulting time courses for both the model error of the binary clas- 
sification and the mutual information (MI) (Schlogl et al. (2003)) on the 
previously unlabeled data are presented in Fig. 3. During the first four sec- 
onds the classification is rather by chance, after four seconds a steep ascent 
in the classification accuracy can be observed in both the raising MI and 
the decreasing classification error. Although the Bayes error bound starts 
to gradually increase again after 4.5 s, indicating fading separability, the full 
model still gains information due to the integration process, so that at 6.8 s 
an overall minimum error of 10.7% is achieved. The MI maximum of 0.61 
Bit occurs at 7.6 s indicating a peak decision confidence at this time. Demon- 
strating the time courses of the class means and the standard deviations 
of the decision the right panel of Fig. 3 emphazises the high discriminative 
ability of the proposed procedure: around 6 s there is no overlap between 
the class standard deviation tubes, reflecting the high confidence of the de- 
cisions. A comprehensive comparison of all submitted techniques to solve 
the specific task for the data set III of the BCI competition is provided in 
(http://ida.first.fraunhofer.de/projects/bci/competition/. . . 

. . . results/index. html#graz). Basically this evaluation reveals that the 
proposed algorithm outperforms all competing approaches, including tradi- 
tional adaptive AR-parameter based methods. 
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